Introduction

The objective of this study is to determine the correlation that exists in the levels of occupancy of park parking lots in the United States regarding the months, days and hours of an analyzed period from the development of a k-means algorithm. To achieve this, I’ll use the R programming language and a dataset of 35,717 records with data obtained from October to December 2016.

K Means is a grouping or clustering method. The term “k-means” was first used by James MacQueen in 1967, although the idea goes back to Hugo Steinhaus in 1957.

The standard algorithm was first proposed by Stuart Lloyd in 1957 as a technique for pulse code modulation, although it was not published outside of Bell Labs until 1982.

Case Study

The general objective of this research is to determine the correlation that exists in the occupancy levels of park parking lots in the United States, regarding the months, days and hours of the period analyzed through the development of a k-means s algorithm.

The specific objectives are as follows:

Establish an exploratory analysis of the data through calculations of means, medians and other measures of central tendency.
Develop and implement the K-means algorithm.
Determine the correlation that exists in the occupancy levels with respect to the months, days and hours of the period analyzed.

Methodology

Clustering is a technique for finding and classifying K groups of data (clusters). Thus, elements that share similar characteristics will be together in the same group, separated from other groups with which they do not share characteristics.

To find out if the data are similar or different, the K-means algorithm uses the distance between the data. Observations that are similar will have a smaller distance between them. In general, the Euclidean distance is used as a measure, although other functions can also be used.

Clustering algorithms are considered unsupervised learning algorithms. This type of unsupervised learning algorithms looks for patterns in the data without having a specific prediction as a target (there is no dependent variable). Instead of having an output, the data has only one input which would be the multiple variables describing the data.

The K-means algorithm needs as input the number of groups into which we are going to segment the population. From this number k of clusters, the algorithm first places k random points (centroids). It then assigns to any of these points all the samples with the smallest distances. Next, the point is shifted to the mean of the nearest samples.

This will generate a new sample assignment, as some samples are now closer to another centroid. This process is repeated iteratively and the groups are adjusted until the allocation no longer changes by moving the points. This final result represents the adjustment that maximizes the distance between the different groups and minimizes the intra-group distance.

This type of unsupervised learning algorithm is useful for exploring, describing and summarizing data in a different way. Using this data clustering can help us to confirm (or reject) some kind of previous classification. It can also help us to discover patterns and relationships we were unaware of.

The dataset to be analyzed consists of the following characteristics:

35,717 tuples or rows. Four columns with the following data:
- SystemCode: Indicates the code associated with the parking lot.
- Capacity: Indicates the total number of stalls available in the parking lot.
- Occupancy: Indicates the number of parking spaces that are occupied at the time the report is issued.
- LastUpdated: Indicates the exact date on which the tuple data was recorded.

In order to manage the data more efficiently when implementing the k-means algorithm, the LastUpdated field was divided into four fields:

Year: Year of tuple data record.
Month: Month of the tuple data record.
Day: Day of the tuple data record.
Time: Time of the tuple data record. # Exploratory Analysis

Let’s import the cluster and fpc libraries, and create the dataset.

In [48]:

library(cluster)
library(fpc)

set.seed(500)
dataset <- read.csv("../input/dataset/parking-clean-data.csv")

The minimum value, first quartile, median, mean, third quartile and maximum value are obtained for each of the five columns to be analyzed, and then box plots are generated with this information.

We can observe that the parking lots analyzed have a capacity that varies from 220 stalls to 4675 stalls, with a mean of 1398 stalls. Their occupancy varies from 0 stalls to 4327 stalls, with an average of 642.2 stalls.

Regarding the period analyzed, it ranges from month 10 to month 12 of 2016, with an average of 10.88. Regarding the days analyzed, a complete run is made from days 1 to 31, with an average of 15.16. The analyzed hours range from 7:30 to 16:34, with a mean of 12:05.

It is observed that the Capacity and Occupancy fields have outliers that may affect the effectiveness of the clustering process, however they will not be adjusted yet due to their high importance in this analysis process. After the clustering process, outliers in the created clusters will be removed in order to generate more accurate conclusions.

In [49]:

# Cast to numeric columns
dataset$Capacity <- as.numeric(dataset$Capacity)
dataset$Occupancy <- as.numeric(dataset$Occupancy)
dataset$Month <- as.numeric(dataset$Month)
dataset$Day <- as.numeric(dataset$Day)
dataset$Time <- as.numeric(dataset$Time)

In [50]:

# Ignore NA values
dataset <- na.omit(dataset,na.action=TRUE)

# Create new dataset with the columns that will be used
mydata <- dataset[,c(2,3,5:7)]

# Get a summary of the dataset
summary(mydata)
boxplot(mydata)

    Capacity      Occupancy          Month            Day             Time     
 Min.   : 220   Min.   :   0.0   Min.   :10.00   Min.   : 1.00   Min.   : 730  
 1st Qu.: 500   1st Qu.: 210.0   1st Qu.:10.00   1st Qu.: 8.00   1st Qu.:1000  
 Median : 849   Median : 446.0   Median :11.00   Median :15.00   Median :1204  
 Mean   :1398   Mean   : 642.2   Mean   :10.88   Mean   :15.16   Mean   :1205  
 3rd Qu.:2009   3rd Qu.: 798.0   3rd Qu.:11.00   3rd Qu.:22.00   3rd Qu.:1429  
 Max.   :4675   Max.   :4327.0   Max.   :12.00   Max.   :31.00   Max.   :1634

In [51]:

# Capacity boxplot
boxplot(mydata$Capacity,
        main = "Capacity",
        xlab = "Cars",
        ylab = "Stalls",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

In [52]:

# Occupancy boxplot
boxplot(mydata$Occupancy,
        main = "Occupancy",
        xlab = "Cars",
        ylab = "Stalls",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

In [53]:

# Month boxplot
boxplot(mydata$Month,
        main = "Month",
        xlab = "Time",
        ylab = "Months",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

Warning message in bxp(list(stats = structure(c(10, 10, 11, 11, 12), .Dim = c(5L, :
“some notches went outside hinges ('box'): maybe set notch=FALSE”

In [54]:

# Day boxplot
boxplot(mydata$Day,
        main = "Day",
        xlab = "Time",
        ylab = "Days of the month",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

In [55]:

# Time boxplot
boxplot(mydata$Time,
        main = "Hour",
        xlab = "Time",
        ylab = "Hours of the day",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

K-means Algorithm

The variables in the dataset are at different scales, so it is necessary to scale the dataset to maintain uniformity as a step prior to running the K-means algorithm.

To obtain the ideal number of clusters (k), the variance of the dataset is stored in an array and then iterated 15 times over the array. At each iteration, the sum of squares of the intra-cluster union strength factor (withinss) is recorded. The results are then plotted and the k value at which the slope of the curve shows the greatest change is selected.

In [56]:

# The dataset variables are at different scales.
# To maintain uniformity, the columns are scaled
scaled_data <- scale(mydata[,1:5])

## Calculate the variance
wss <- (nrow(scaled_data)-1)*sum(apply(scaled_data,2,var))

# Iterate over the wws array 15 times.
# Record at each iteration the sum of squares of the factor of intra-cluster union strength for each cluster (withinss)
for(i in 2:15)wss[i]<- sum(fit=kmeans(scaled_data,centers=i,15)$withinss)
## Graph each iteration
plot(1:15,wss,type="b",main="15 clusters",xlab="# of cluster",ylab="sum of squares")

From this graph, we can observe that the slope has a more significant change when the number of clusters is equal to 3, which is why this value will be taken as the number of clusters to be used.

After applying the kmeans function, properties of the generated clusters are evaluated:

Centers: vector indicating the mean of each variable for each cluster.
Totss (Total sum of squares). Withinss (Intra-cluster sum of squares).
Tot.withinss (Total sum of intra-cluster squares).
Betweenss (Sum of inter-cluster squares).
Size: Number of points in each cluster.

In [57]:

# The slope varies mainly in the third iteration, 
# so 3 is considered as the optimal number of clusters.
fit <- kmeans(scaled_data, 3)

fit$centers
fit$totss
fit$withinss
fit$tot.withinss
fit$betweenss
fit$size

	Capacity	Occupancy	Month	Day	Time
1	-0.4041041	-0.3906482	-0.679259247	0.556198370	-0.07882166
2	-0.3737569	-0.3354071	0.870854168	-0.714267704	-0.04301146
3	1.6444271	1.5418053	-0.005875559	0.007001652	0.26551874

178580

44049.45510661
30017.6406344365
32196.509518354

106263.605259401

72316.3947405994

16187
12672
6858

The intra-cluster distances are 44049.46, 30017.64 and 32196.51, while the inter-cluster distance is 72316.39.

Through the plotcluster function, a graph with the clustering result is obtained, and subsequently a matrix with the means of each cluster in each dimension is obtained to evaluate how different the clusters obtained are.

The first cluster has a size of 16187 points (red in the graph), the second is 12672 points (black in the graph) and the third is 6858 points (green in the graph).

From the graph obtained, it can be deduced that there are three large differentiated groups into which the data can be categorized. The groups are explained in detail below.

In [58]:

# Interpret patterns
plotcluster(scaled_data,fit$cluster)
points(fit$centers,col=1:5,pch=16)

# Mean of each variable in each cluster
mean_data <- dataset[,c(2,3,5:7)]
mean_data <- data.frame(mean_data,fit$cluster)
mean_ds <- aggregate(mean_data[,1:5],by = list(fit$cluster),FUN = mean)
mean_ds

Group.1	Capacity	Occupancy	Month	Day	Time
<int>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
1	920.9793	385.5916	10.37036	19.773522	1184.315
2	956.7685	421.8825	11.53614	9.239189	1193.640
3	3336.8672	1655.1260	10.87679	15.219743	1273.979

The results of the mean matrix are as follows:

Group 1:
- Capacity: 920.9793
- Occupancy: 385.5916
- Month: 10.37036
- Day: 19.773522
- Time: 1184.315
Group 2:
- Capacity: 956.7685
- Occupancy: 421.8825
- Month: 11.53614
- Day: 9.239189
- Time: 1193.640
Group 3:
- Capacity: 3336.8672
- Occupancy: 1655.1260
- Month: 10.87679
- Day: 15.219743
- Time: 1273.979

These data can be synthesized as follows:

Group 1: is composed of low and medium capacity parking lots, mainly during the month of October and part of the month of November. It consists of 1,6187 entries (45.32%).
Group 2: is composed of low and medium capacity parking lots during the months of November and December. It consists of 12672 records (35.47%).
Group 3: is composed of high capacity parking lots during the months of October, November and December. It consists of 6858 records (19.20%).

In [59]:

# Assign a variable for each cluster
cluster1 = subset(mean_data, fit.cluster == 1)
cluster2 = subset(mean_data, fit.cluster == 2)
cluster3 = subset(mean_data, fit.cluster == 3)

# summary of the clusters
summary(cluster1)

    Capacity      Occupancy          Month            Day             Time     
 Min.   : 220   Min.   :   0.0   Min.   :10.00   Min.   : 4.00   Min.   : 730  
 1st Qu.: 485   1st Qu.: 175.0   1st Qu.:10.00   1st Qu.:15.00   1st Qu.: 932  
 Median : 720   Median : 334.0   Median :10.00   Median :21.00   Median :1200  
 Mean   : 921   Mean   : 385.6   Mean   :10.37   Mean   :19.77   Mean   :1184  
 3rd Qu.:1194   3rd Qu.: 560.0   3rd Qu.:11.00   3rd Qu.:26.00   3rd Qu.:1426  
 Max.   :3883   Max.   :1451.0   Max.   :11.00   Max.   :31.00   Max.   :1634  
  fit.cluster
 Min.   :1   
 1st Qu.:1   
 Median :1   
 Mean   :1   
 3rd Qu.:1   
 Max.   :1

In [60]:

boxplot(cluster1)

In [61]:

summary(cluster2)

    Capacity        Occupancy          Month            Day        
 Min.   : 220.0   Min.   :   0.0   Min.   :11.00   Min.   : 1.000  
 1st Qu.: 485.0   1st Qu.: 187.0   1st Qu.:11.00   1st Qu.: 5.000  
 Median : 720.0   Median : 364.5   Median :12.00   Median : 9.000  
 Mean   : 956.8   Mean   : 421.9   Mean   :11.54   Mean   : 9.239  
 3rd Qu.:1200.0   3rd Qu.: 605.0   3rd Qu.:12.00   3rd Qu.:13.000  
 Max.   :3103.0   Max.   :1618.0   Max.   :12.00   Max.   :19.000  
      Time       fit.cluster
 Min.   : 732   Min.   :2   
 1st Qu.: 956   1st Qu.:2   
 Median :1200   Median :2   
 Mean   :1194   Mean   :2   
 3rd Qu.:1426   3rd Qu.:2   
 Max.   :1634   Max.   :2

In [62]:

boxplot(cluster2)

In order to facilitate the analysis process, outliers in the Capacity dimension will be eliminated for clusters 1 and 2. Therefore, only those tuples whose Capacity column is less than 2000 positions will be retained. This eliminates 1372 records from cluster 1 and 1263 records from cluster 2, for a total of 2635 records (7.37% of the total data).

The results of the adjustment are presented below:

In [63]:

# Determine number of outliers in clusters 1 and 2
length(cluster1$Capacity[cluster1$Capacity>2000]) # 1372
length(cluster2$Capacity[cluster2$Capacity>2000]) # 1263

# Remove outliers from clusters 1 and 2
cluster1 = subset(cluster1, Capacity<2000)
cluster2 = subset(cluster2, Capacity<2000)

1372

1263

In [64]:

summary(cluster1)
summary(cluster2)
summary(cluster3)

    Capacity        Occupancy          Month            Day       
 Min.   : 220.0   Min.   :   0.0   Min.   :10.00   Min.   : 4.00  
 1st Qu.: 480.0   1st Qu.: 166.0   1st Qu.:10.00   1st Qu.:15.00  
 Median : 690.0   Median : 310.0   Median :10.00   Median :20.00  
 Mean   : 759.6   Mean   : 363.5   Mean   :10.38   Mean   :19.66  
 3rd Qu.:1010.0   3rd Qu.: 532.0   3rd Qu.:11.00   3rd Qu.:26.00  
 Max.   :1920.0   Max.   :1412.0   Max.   :11.00   Max.   :31.00  
      Time       fit.cluster
 Min.   : 730   Min.   :1   
 1st Qu.: 959   1st Qu.:1   
 Median :1203   Median :1   
 Mean   :1199   Mean   :1   
 3rd Qu.:1429   3rd Qu.:1   
 Max.   :1634   Max.   :1

    Capacity        Occupancy          Month            Day        
 Min.   : 220.0   Min.   :   0.0   Min.   :11.00   Min.   : 1.000  
 1st Qu.: 485.0   1st Qu.: 175.0   1st Qu.:11.00   1st Qu.: 5.000  
 Median : 690.0   Median : 332.0   Median :12.00   Median : 9.000  
 Mean   : 770.5   Mean   : 392.9   Mean   :11.53   Mean   : 9.335  
 3rd Qu.:1010.0   3rd Qu.: 568.0   3rd Qu.:12.00   3rd Qu.:13.000  
 Max.   :1920.0   Max.   :1586.0   Max.   :12.00   Max.   :19.000  
      Time       fit.cluster
 Min.   : 732   Min.   :2   
 1st Qu.:1000   1st Qu.:2   
 Median :1225   Median :2   
 Mean   :1207   Mean   :2   
 3rd Qu.:1429   3rd Qu.:2   
 Max.   :1634   Max.   :2

    Capacity      Occupancy        Month            Day             Time     
 Min.   :1920   Min.   : 385   Min.   :10.00   Min.   : 1.00   Min.   : 755  
 1st Qu.:2937   1st Qu.:1102   1st Qu.:10.00   1st Qu.: 9.00   1st Qu.:1101  
 Median :3103   Median :1363   Median :11.00   Median :15.00   Median :1303  
 Mean   :3337   Mean   :1655   Mean   :10.88   Mean   :15.22   Mean   :1274  
 3rd Qu.:3883   3rd Qu.:2194   3rd Qu.:11.00   3rd Qu.:22.00   3rd Qu.:1459  
 Max.   :4675   Max.   :4327   Max.   :12.00   Max.   :31.00   Max.   :1634  
  fit.cluster
 Min.   :3   
 1st Qu.:3   
 Median :3   
 Mean   :3   
 3rd Qu.:3   
 Max.   :3

In [65]:

# cluster 1
# Capacity boxplot
boxplot(cluster1$Capacity,
        main = "Capacity",
        xlab = "Cars",
        ylab = "Stalls",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

# Occupancy boxplot
boxplot(cluster1$Occupancy,
        main = "Occupancy",
        xlab = "Cars",
        ylab = "Stalls",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

# Month boxplot
boxplot(cluster1$Month,
        main = "Month",
        xlab = "Time",
        ylab = "Months",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

# Day boxplot
boxplot(cluster1$Day,
        main = "Day",
        xlab = "Time",
        ylab = "Days of month",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

# Time boxplot
boxplot(cluster1$Time,
        main = "Hour",
        xlab = "Time",
        ylab = "Hours of Day",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

Warning message in bxp(list(stats = structure(c(10, 10, 10, 11, 11), .Dim = c(5L, :
“some notches went outside hinges ('box'): maybe set notch=FALSE”

In [66]:

# cluster 2
# Capacity boxplot
boxplot(cluster2$Capacity,
        main = "Capacity",
        xlab = "Cars",
        ylab = "Stalls",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

# Occupancy boxplot
boxplot(cluster2$Occupancy,
        main = "Occupancy",
        xlab = "Cars",
        ylab = "Stalls",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

# Month boxplot
boxplot(cluster2$Month,
        main = "Month",
        xlab = "Time",
        ylab = "Months",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

# Day boxplot
boxplot(cluster2$Day,
        main = "Day",
        xlab = "Time",
        ylab = "Days of month",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

# Time boxplot
boxplot(cluster2$Time,
        main = "Hour",
        xlab = "Time",
        ylab = "Hours of Day",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

Warning message in bxp(list(stats = structure(c(11, 11, 12, 12, 12), .Dim = c(5L, :
“some notches went outside hinges ('box'): maybe set notch=FALSE”

In [67]:

# cluster 3
# Capacity boxplot
boxplot(cluster3$Capacity,
        main = "Capacity",
        xlab = "Cars",
        ylab = "Stalls",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

# Occupancy boxplot
boxplot(cluster3$Occupancy,
        main = "Occupancy",
        xlab = "Cars",
        ylab = "Stalls",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

# Month boxplot
boxplot(cluster3$Month,
        main = "Month",
        xlab = "Time",
        ylab = "Months",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

# Day boxplot
boxplot(cluster3$Day,
        main = "Day",
        xlab = "Time",
        ylab = "Days of month",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

# Time boxplot
boxplot(cluster3$Time,
        main = "Hour",
        xlab = "Time",
        ylab = "Hours of Day",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

Warning message in bxp(list(stats = structure(c(10, 10, 11, 11, 12), .Dim = c(5L, :
“some notches went outside hinges ('box'): maybe set notch=FALSE”

Results and Conclutions

Based on the summary of the clusters and the box plots obtained, the following conclusions can be drawn.

General analysis of the clusters

Those parking lots with low and medium capacity (between 220 and 1920 stalls, with an average of 759.6) will have an average occupancy of 363.5 stalls during the month of October, which represents an occupancy of 47.85% of the average capacity. This occupancy increases during the months of November and December to 392.9 stalls out of an average capacity of 770.5, which constitutes 51% occupancy.

High capacity parking lots (between 1920 and 4675 stalls, with an average of 3337) will have an average occupancy of 1655, representing 49.6%.

Cluster analysis

Cluster 1

The parking lots with high occupancy (value above the third quartile, which is 532 stalls) have a similar occupancy during the days of the period analyzed.

Parking lots with intermediate occupancy (value higher than the first quartile, which is 166 stalls and lower than the third quartile, which is 532 stalls) have similar occupancy during the days of the period analyzed.

The parking lots with low occupancy (value lower than the first quartile which is 166) during the days of October and November are evenly distributed.

During the first hours of the day (value below the first quartile which is 9:59) there is a low occupancy, with a mean of 227.5 stalls, and in which 75% of the records (up to the third quartile) reach a maximum of 329 stalls, this is a value below the mean of this cluster (353.5 stalls).

During intermediate hours of the day (value above the first quartile 9:59 and below the third quartile 14:29) a high occupancy is recorded, with an average of 411.7 stalls, and in which 75% of the records (up to the third quartile) reach a maximum of 598 stalls.

During the last hours of the day (value above the third quartile which is 14:29) there is an intermediate occupancy, with an average of 405.2 stalls, and in which 75% of the records (up to the third quartile) reach a maximum of 567 stalls.

In [68]:

#DIAS
# High occupancy days
test = subset(cluster1, Occupancy > 532)
summary(test)
boxplot(test$Day,
        main = "High Occupancy",
        xlab = "Time",
        ylab = "Days of the month",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

# Intermediate occupancy days
test = subset(cluster1, Occupancy < 532 & Occupancy > 166)
summary(test)
boxplot(test$Day,
        main = "Intermediate Occupancy",
        xlab = "Time",
        ylab = "Days of the month",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

# Low occupancy days
test = subset(cluster1, Occupancy < 166)
summary(test)
boxplot(test$Day,
        main = "Low Occupancy",
        xlab = "Time",
        ylab = "Days of the month",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

# Hours
# Early Hours
test = subset(cluster1, Time < 959)
summary(test)
boxplot(test$Occupancy,
        main = "Occupancy in the early hours",
        xlab = "Cars",
        ylab = "Stalls",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

# Intermediate Hours
test = subset(cluster1, Time > 959 & Time < 1429)
summary(test)
boxplot(test$Occupancy,
        main = "Ocupping at intermediate times",
        xlab = "Cars",
        ylab = "Stalls",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

# Late hours
test = subset(cluster1, Time > 1429)
summary(test)
boxplot(test$Occupancy,
        main = "Occupancy in the last hours",
        xlab = "Cars",
        ylab = "Stalls",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

    Capacity      Occupancy          Month            Day             Time     
 Min.   : 577   Min.   : 533.0   Min.   :10.00   Min.   : 4.00   Min.   : 755  
 1st Qu.: 720   1st Qu.: 597.0   1st Qu.:10.00   1st Qu.:15.00   1st Qu.:1130  
 Median :1010   Median : 673.0   Median :10.00   Median :21.00   Median :1304  
 Mean   :1066   Mean   : 722.4   Mean   :10.41   Mean   :19.98   Mean   :1289  
 3rd Qu.:1268   3rd Qu.: 796.0   3rd Qu.:11.00   3rd Qu.:26.00   3rd Qu.:1434  
 Max.   :1920   Max.   :1412.0   Max.   :11.00   Max.   :31.00   Max.   :1634  
  fit.cluster
 Min.   :1   
 1st Qu.:1   
 Median :1   
 Mean   :1   
 3rd Qu.:1   
 Max.   :1

    Capacity        Occupancy         Month            Day       
 Min.   : 220.0   Min.   :167.0   Min.   :10.00   Min.   : 4.00  
 1st Qu.: 485.0   1st Qu.:224.0   1st Qu.:10.00   1st Qu.:14.00  
 Median : 600.0   Median :311.0   Median :10.00   Median :20.00  
 Mean   : 697.6   Mean   :320.5   Mean   :10.37   Mean   :19.42  
 3rd Qu.: 849.0   3rd Qu.:405.0   3rd Qu.:11.00   3rd Qu.:26.00  
 Max.   :1920.0   Max.   :531.0   Max.   :11.00   Max.   :31.00  
      Time       fit.cluster
 Min.   : 732   Min.   :1   
 1st Qu.:1001   1st Qu.:1   
 Median :1226   Median :1   
 Mean   :1212   Mean   :1   
 3rd Qu.:1430   3rd Qu.:1   
 Max.   :1634   Max.   :1

    Capacity        Occupancy          Month            Day       
 Min.   : 220.0   Min.   :  0.00   Min.   :10.00   Min.   : 4.00  
 1st Qu.: 387.0   1st Qu.: 57.00   1st Qu.:10.00   1st Qu.:15.00  
 Median : 470.0   Median : 93.00   Median :10.00   Median :20.00  
 Mean   : 578.5   Mean   : 91.86   Mean   :10.36   Mean   :19.84  
 3rd Qu.: 788.0   3rd Qu.:130.00   3rd Qu.:11.00   3rd Qu.:26.00  
 Max.   :1268.0   Max.   :165.00   Max.   :11.00   Max.   :31.00  
      Time       fit.cluster
 Min.   : 730   Min.   :1   
 1st Qu.: 831   1st Qu.:1   
 Median :1001   Median :1   
 Mean   :1082   Mean   :1   
 3rd Qu.:1304   3rd Qu.:1   
 Max.   :1634   Max.   :1

    Capacity        Occupancy          Month            Day       
 Min.   : 220.0   Min.   :   0.0   Min.   :10.00   Min.   : 4.00  
 1st Qu.: 480.0   1st Qu.:  78.0   1st Qu.:10.00   1st Qu.:15.00  
 Median : 690.0   Median : 180.0   Median :10.00   Median :20.00  
 Mean   : 778.8   Mean   : 227.5   Mean   :10.38   Mean   :19.54  
 3rd Qu.:1010.0   3rd Qu.: 329.0   3rd Qu.:11.00   3rd Qu.:26.00  
 Max.   :1920.0   Max.   :1262.0   Max.   :11.00   Max.   :31.00  
      Time        fit.cluster
 Min.   :730.0   Min.   :1   
 1st Qu.:804.0   1st Qu.:1   
 Median :856.0   Median :1   
 Mean   :858.8   Mean   :1   
 3rd Qu.:926.0   3rd Qu.:1   
 Max.   :958.0   Max.   :1

    Capacity        Occupancy          Month            Day       
 Min.   : 220.0   Min.   :   0.0   Min.   :10.00   Min.   : 4.00  
 1st Qu.: 470.0   1st Qu.: 203.0   1st Qu.:10.00   1st Qu.:15.00  
 Median : 690.0   Median : 358.0   Median :10.00   Median :20.00  
 Mean   : 753.7   Mean   : 411.7   Mean   :10.39   Mean   :19.82  
 3rd Qu.:1010.0   3rd Qu.: 598.0   3rd Qu.:11.00   3rd Qu.:26.00  
 Max.   :1920.0   Max.   :1412.0   Max.   :11.00   Max.   :31.00  
      Time       fit.cluster
 Min.   :1000   Min.   :1   
 1st Qu.:1101   1st Qu.:1   
 Median :1204   Median :1   
 Mean   :1200   Mean   :1   
 3rd Qu.:1326   3rd Qu.:1   
 Max.   :1428   Max.   :1

    Capacity        Occupancy          Month            Day       
 Min.   : 220.0   Min.   :   0.0   Min.   :10.00   Min.   : 4.00  
 1st Qu.: 470.0   1st Qu.: 197.0   1st Qu.:10.00   1st Qu.:14.00  
 Median : 690.0   Median : 368.0   Median :10.00   Median :20.00  
 Mean   : 750.7   Mean   : 405.2   Mean   :10.38   Mean   :19.34  
 3rd Qu.:1010.0   3rd Qu.: 567.0   3rd Qu.:11.00   3rd Qu.:26.00  
 Max.   :1920.0   Max.   :1263.0   Max.   :11.00   Max.   :31.00  
      Time       fit.cluster
 Min.   :1430   Min.   :1   
 1st Qu.:1501   1st Qu.:1   
 Median :1531   Median :1   
 Mean   :1539   Mean   :1   
 3rd Qu.:1604   3rd Qu.:1   
 Max.   :1634   Max.   :1

Cluster 2

Parking lots with high occupancy (value higher than the third quartile of 568 stalls) show similar occupancy during the days of the period analyzed.
Parking lots with intermediate occupancy (value higher than the first quartile, which is 175 stalls and lower than the third quartile, which is 568 stalls) have similar occupancy during the days of the period analyzed.
Parking lots with low occupancy (value lower than the first quartile which is 175 stalls) present a similar occupancy during the days of the analyzed period.
During the first hours of the day (value below the first quartile which is 10:00) there is a low occupancy, with an average of 245.6 stalls, and in which 75% of the records (up to the third quartile) reach a maximum of 359.8 stalls, this is a value below the average of this cluster (392.9 stalls).
During intermediate hours of the day (value above the first quartile 10:00 and below the third quartile 14:29) there is a high occupancy, with a mean of 451.2 stalls, and in which 75% of the records (up to the third quartile) reach a maximum of 647 stalls.
During the last hours of the day (value above the third quartile which is 14:29) there is an intermediate occupancy, with an average of 430 stalls, and in which 75% of the records (up to the third quartile) reach a maximum of 602 stalls.

In [69]:

# Days
# Days with high occupancy
test = subset(cluster2, Occupancy > 568)
summary(test)
boxplot(test$Day,
        main = "High Occupancy",
        xlab = "Time",
        ylab = "Days of the Month",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

# Days with intermediate occupancy
test = subset(cluster2, Occupancy > 175 & Occupancy < 568)
summary(test)
boxplot(test$Day,
        main = "Intermediate Occupancy",
        xlab = "Time",
        ylab = "Days of the Month",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

# Days with low occupancy
test = subset(cluster2, Occupancy < 175)
summary(test)
boxplot(test$Day,
        main = "Low Occupancy",
        xlab = "Time",
        ylab = "Days of the Month",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

# Hours
# Early Hours
test = subset(cluster2, Time < 1000)
summary(test)
boxplot(test$Occupancy,
        main = "Occupancy in the early hours",
        xlab = "Cars",
        ylab = "Stalls",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

# Intermediate hours
test = subset(cluster2, Time > 1000 & Time < 1429)
summary(test)
boxplot(test$Occupancy,
        main = "Occupancy in the intermediate hours",
        xlab = "Cars",
        ylab = "Stalls",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

# Late hours
test = subset(cluster2, Time > 1429)
summary(test)
boxplot(test$Occupancy,
        main = "Occupancy at last hours",
        xlab = "Cars",
        ylab = "Stalls",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

    Capacity      Occupancy          Month           Day              Time     
 Min.   : 600   Min.   : 569.0   Min.   :11.0   Min.   : 1.000   Min.   : 756  
 1st Qu.: 863   1st Qu.: 641.0   1st Qu.:11.0   1st Qu.: 5.000   1st Qu.:1130  
 Median :1010   Median : 728.0   Median :12.0   Median : 9.000   Median :1303  
 Mean   :1113   Mean   : 787.7   Mean   :11.6   Mean   : 9.244   Mean   :1294  
 3rd Qu.:1322   3rd Qu.: 860.0   3rd Qu.:12.0   3rd Qu.:14.000   3rd Qu.:1456  
 Max.   :1920   Max.   :1586.0   Max.   :12.0   Max.   :19.000   Max.   :1634  
  fit.cluster
 Min.   :2   
 1st Qu.:2   
 Median :2   
 Mean   :2   
 3rd Qu.:2   
 Max.   :2

    Capacity      Occupancy         Month            Day              Time     
 Min.   : 220   Min.   :176.0   Min.   :11.00   Min.   : 1.000   Min.   : 740  
 1st Qu.: 485   1st Qu.:238.0   1st Qu.:11.00   1st Qu.: 5.000   1st Qu.:1000  
 Median : 600   Median :333.0   Median :12.00   Median : 9.000   Median :1226  
 Mean   : 695   Mean   :342.1   Mean   :11.52   Mean   : 9.249   Mean   :1218  
 3rd Qu.: 849   3rd Qu.:434.0   3rd Qu.:12.00   3rd Qu.:13.000   3rd Qu.:1430  
 Max.   :1920   Max.   :567.0   Max.   :12.00   Max.   :19.000   Max.   :1634  
  fit.cluster
 Min.   :2   
 1st Qu.:2   
 Median :2   
 Mean   :2   
 3rd Qu.:2   
 Max.   :2

    Capacity        Occupancy         Month            Day        
 Min.   : 220.0   Min.   :  0.0   Min.   :11.00   Min.   : 1.000  
 1st Qu.: 317.0   1st Qu.: 62.0   1st Qu.:11.00   1st Qu.: 6.000  
 Median : 470.0   Median :103.0   Median :11.00   Median :10.000  
 Mean   : 580.5   Mean   :100.4   Mean   :11.48   Mean   : 9.572  
 3rd Qu.: 788.0   3rd Qu.:143.0   3rd Qu.:12.00   3rd Qu.:13.000  
 Max.   :1268.0   Max.   :174.0   Max.   :12.00   Max.   :19.000  
      Time       fit.cluster
 Min.   : 732   Min.   :2   
 1st Qu.: 850   1st Qu.:2   
 Median :1026   Median :2   
 Mean   :1099   Mean   :2   
 3rd Qu.:1327   3rd Qu.:2   
 Max.   :1634   Max.   :2

    Capacity      Occupancy          Month            Day        
 Min.   : 220   Min.   :   2.0   Min.   :11.00   Min.   : 1.000  
 1st Qu.: 485   1st Qu.:  84.0   1st Qu.:11.00   1st Qu.: 5.000  
 Median : 690   Median : 188.0   Median :12.00   Median : 9.000  
 Mean   : 785   Mean   : 245.6   Mean   :11.56   Mean   : 9.312  
 3rd Qu.:1010   3rd Qu.: 359.8   3rd Qu.:12.00   3rd Qu.:13.000  
 Max.   :1920   Max.   :1329.0   Max.   :12.00   Max.   :19.000  
      Time        fit.cluster
 Min.   :732.0   Min.   :2   
 1st Qu.:826.0   1st Qu.:2   
 Median :859.0   Median :2   
 Mean   :867.2   Mean   :2   
 3rd Qu.:927.0   3rd Qu.:2   
 Max.   :959.0   Max.   :2

    Capacity        Occupancy          Month            Day        
 Min.   : 220.0   Min.   :   1.0   Min.   :11.00   Min.   : 1.000  
 1st Qu.: 485.0   1st Qu.: 215.0   1st Qu.:11.00   1st Qu.: 5.000  
 Median : 690.0   Median : 396.0   Median :12.00   Median : 9.000  
 Mean   : 766.1   Mean   : 451.2   Mean   :11.53   Mean   : 9.232  
 3rd Qu.:1010.0   3rd Qu.: 647.0   3rd Qu.:12.00   3rd Qu.:13.000  
 Max.   :1920.0   Max.   :1586.0   Max.   :12.00   Max.   :19.000  
      Time       fit.cluster
 Min.   :1002   Min.   :2   
 1st Qu.:1109   1st Qu.:2   
 Median :1226   Median :2   
 Mean   :1215   Mean   :2   
 3rd Qu.:1327   3rd Qu.:2   
 Max.   :1427   Max.   :2

    Capacity        Occupancy        Month            Day        
 Min.   : 220.0   Min.   :   0   Min.   :11.00   Min.   : 1.000  
 1st Qu.: 485.0   1st Qu.: 207   1st Qu.:11.00   1st Qu.: 6.000  
 Median : 690.0   Median : 386   Median :12.00   Median :10.000  
 Mean   : 763.3   Mean   : 430   Mean   :11.53   Mean   : 9.656  
 3rd Qu.:1010.0   3rd Qu.: 602   3rd Qu.:12.00   3rd Qu.:14.000  
 Max.   :1920.0   Max.   :1488   Max.   :12.00   Max.   :19.000  
      Time       fit.cluster
 Min.   :1430   Min.   :2   
 1st Qu.:1500   1st Qu.:2   
 Median :1532   Median :2   
 Mean   :1547   Mean   :2   
 3rd Qu.:1603   3rd Qu.:2   
 Max.   :1634   Max.   :2

Cluster 3

The parking lots with high occupancy (value higher than the third quartile, which is 2194 stalls) show similar occupancy during the days of the period analyzed.
Parking lots with intermediate occupancy (value higher than the first quartile, which is 1102 stalls and lower than the third quartile, which is 2194 stalls) show similar occupancy during the days of the period analyzed.
Parking lots with low occupancy (value lower than the first quartile which is 1102 stalls) present a similar occupancy during the days of the analyzed period.
During the first hours of the day (value below the first quartile which is 11:01) there is a low occupancy, with an average of 1351 stalls, and in which 75% of the records (up to the third quartile) reach a maximum of 1573 stalls, this is a value below the average of this cluster (2194 stalls).
During intermediate hours of the day (value above the first quartile 11:01 and below the third quartile 14:59) there is a higher occupancy, with an average of 1819 stalls, and in which 75% of the records (up to the third quartile) reach a maximum of 2567 stalls.
During the last hours of the day (value above the third quartile which is 14:59) there is an intermediate occupancy, with an average of 1308 stalls, and in which 75% of the records (up to the third quartile) reach a maximum of 2182 stalls.

In [70]:

# Days
# Days with high occupancy
test = subset(cluster3, Occupancy > 2194)
summary(test)
boxplot(test$Day,
        main = "High Occupancy",
        xlab = "Time",
        ylab = "Days of the Month",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

# Days with intermediate occupancy
test = subset(cluster3, Occupancy > 1102 & Occupancy < 2194)
summary(test)
boxplot(test$Day,
        main = "Intermediate Occupancy",
        xlab = "Time",
        ylab = "Days of the Month",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

# Days with low occupancy
test = subset(cluster3, Occupancy < 1102)
summary(test)
boxplot(test$Day,
        main = "Low Occupancy",
        xlab = "Time",
        ylab = "Days of the Month",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

# Hours
# Early hours
test = subset(cluster3, Time < 1101)
summary(test)
boxplot(test$Occupancy,
        main = "Occupancy in the early hours",
        xlab = "Cars",
        ylab = "Stalls",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

# Intermediate hours
test = subset(cluster3, Time > 1101 & Time < 1459)
summary(test)
boxplot(test$Occupancy,
        main = "Occupancy at intermediate hours",
        xlab = "Cars",
        ylab = "Stalls",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

# Late hours
test = subset(cluster3, Time > 1459)
summary(test)
boxplot(test$Occupancy,
        main = "Occupancy in the last hours",
        xlab = "Cars",
        ylab = "Stalls",
        col = "orange",
        border = "brown",
        horizontal = FALSE,
        notch = TRUE)

    Capacity      Occupancy        Month         Day             Time     
 Min.   :3053   Min.   :2195   Min.   :10   Min.   : 1.00   Min.   : 856  
 1st Qu.:3883   1st Qu.:2556   1st Qu.:10   1st Qu.: 8.00   1st Qu.:1200  
 Median :3883   Median :2834   Median :11   Median :15.00   Median :1328  
 Mean   :4105   Mean   :2882   Mean   :11   Mean   :15.28   Mean   :1318  
 3rd Qu.:4675   3rd Qu.:3185   3rd Qu.:12   3rd Qu.:22.00   3rd Qu.:1459  
 Max.   :4675   Max.   :4327   Max.   :12   Max.   :31.00   Max.   :1634  
  fit.cluster
 Min.   :3   
 1st Qu.:3   
 Median :3   
 Mean   :3   
 3rd Qu.:3   
 Max.   :3

    Capacity      Occupancy        Month            Day             Time     
 Min.   :1920   Min.   :1103   Min.   :10.00   Min.   : 1.00   Min.   : 755  
 1st Qu.:2009   1st Qu.:1229   1st Qu.:10.00   1st Qu.: 9.00   1st Qu.:1127  
 Median :3053   Median :1363   Median :11.00   Median :15.00   Median :1304  
 Mean   :2979   Mean   :1436   Mean   :10.86   Mean   :15.38   Mean   :1288  
 3rd Qu.:3103   3rd Qu.:1551   3rd Qu.:11.00   3rd Qu.:22.00   3rd Qu.:1457  
 Max.   :4675   Max.   :2193   Max.   :12.00   Max.   :31.00   Max.   :1634  
  fit.cluster
 Min.   :3   
 1st Qu.:3   
 Median :3   
 Mean   :3   
 3rd Qu.:3   
 Max.   :3

    Capacity      Occupancy          Month            Day             Time     
 Min.   :1920   Min.   : 385.0   Min.   :10.00   Min.   : 1.00   Min.   : 755  
 1st Qu.:2937   1st Qu.: 727.0   1st Qu.:10.00   1st Qu.: 9.00   1st Qu.: 930  
 Median :3103   Median : 904.0   Median :11.00   Median :14.00   Median :1159  
 Mean   :3284   Mean   : 865.7   Mean   :10.79   Mean   :14.83   Mean   :1202  
 3rd Qu.:3103   3rd Qu.:1016.0   3rd Qu.:11.00   3rd Qu.:20.00   3rd Qu.:1501  
 Max.   :4675   Max.   :1101.0   Max.   :12.00   Max.   :31.00   Max.   :1634  
  fit.cluster
 Min.   :3   
 1st Qu.:3   
 Median :3   
 Mean   :3   
 3rd Qu.:3   
 Max.   :3

    Capacity      Occupancy        Month            Day             Time       
 Min.   :1920   Min.   : 467   Min.   :10.00   Min.   : 1.00   Min.   : 755.0  
 1st Qu.:3053   1st Qu.: 902   1st Qu.:10.00   1st Qu.: 8.00   1st Qu.: 900.0  
 Median :3883   Median :1140   Median :11.00   Median :14.00   Median : 957.0  
 Mean   :3687   Mean   :1351   Mean   :10.86   Mean   :14.83   Mean   : 949.2  
 3rd Qu.:4675   3rd Qu.:1573   3rd Qu.:11.00   3rd Qu.:21.00   3rd Qu.:1029.0  
 Max.   :4675   Max.   :3384   Max.   :12.00   Max.   :31.00   Max.   :1100.0  
  fit.cluster
 Min.   :3   
 1st Qu.:3   
 Median :3   
 Mean   :3   
 3rd Qu.:3   
 Max.   :3

    Capacity      Occupancy        Month            Day             Time     
 Min.   :1920   Min.   : 474   Min.   :10.00   Min.   : 1.00   Min.   :1102  
 1st Qu.:2937   1st Qu.:1225   1st Qu.:10.00   1st Qu.: 9.00   1st Qu.:1203  
 Median :3053   Median :1444   Median :11.00   Median :15.00   Median :1303  
 Mean   :3227   Mean   :1819   Mean   :10.89   Mean   :15.19   Mean   :1289  
 3rd Qu.:3883   3rd Qu.:2567   3rd Qu.:11.00   3rd Qu.:21.00   3rd Qu.:1400  
 Max.   :4675   Max.   :4270   Max.   :12.00   Max.   :31.00   Max.   :1458  
  fit.cluster
 Min.   :3   
 1st Qu.:3   
 Median :3   
 Mean   :3   
 3rd Qu.:3   
 Max.   :3

    Capacity      Occupancy        Month            Day             Time     
 Min.   :1920   Min.   : 385   Min.   :10.00   Min.   : 1.00   Min.   :1500  
 1st Qu.:2937   1st Qu.:1076   1st Qu.:10.00   1st Qu.: 9.00   1st Qu.:1527  
 Median :3053   Median :1308   Median :11.00   Median :15.00   Median :1557  
 Mean   :3211   Mean   :1617   Mean   :10.87   Mean   :15.35   Mean   :1566  
 3rd Qu.:3883   3rd Qu.:2182   3rd Qu.:11.00   3rd Qu.:22.00   3rd Qu.:1625  
 Max.   :4675   Max.   :4327   Max.   :12.00   Max.   :31.00   Max.   :1634  
  fit.cluster
 Min.   :3   
 1st Qu.:3   
 Median :3   
 Mean   :3   
 3rd Qu.:3   
 Max.   :3

General Conclusions

There is no direct correlation between the occupancy level and the day of the month, but there is a correlation between the occupancy level and the time of day.

The highest occupancy values occur in intermediate hours (between 10:00 and 14:59 approximately), while the lowest occupancy values are recorded in the early hours of the day (before 10:00).

We can also observe a slight increase in occupancy levels from the second half of November to December compared to the period from October to the first half of November.

Blog Post

Parking Lot Occupancy K-means Analysis in National Parks in the United States

Introduction

Case Study

Methodology

K-means Algorithm

Results and Conclutions

General analysis of the clusters

Cluster analysis

Cluster 1

Cluster 2

Cluster 3

General Conclusions

Sergio Alves

Latest posts

Parking Lot Occupancy K-means Analysis in National Parks in the United States

Forecasting model for influenza A cases (H7N9) based on Random Forests

Walmart — Store Sales Forecasting

Hotel Booking Demand EDA

Archives

Categories

Parking Lot Occupancy K-means Analysis in National Parks in the United States

Forecasting model for influenza A cases (H7N9) based on Random Forests

Walmart — Store Sales Forecasting

Blog Post

Parking Lot Occupancy K-means Analysis in National Parks in the United States

Introduction

Case Study

Methodology

K-means Algorithm

Results and Conclutions

General analysis of the clusters

Cluster analysis

Cluster 1

Cluster 2

Cluster 3

General Conclusions

Sergio Alves

Related Posts

Forecasting model for influenza A cases (H7N9) based on Random Forests

Image Classification with COIL-100 Dataset in PyTorch

5 Useful tensor functions for PyTorch

Latest posts

Parking Lot Occupancy K-means Analysis in National Parks in the United States

Forecasting model for influenza A cases (H7N9) based on Random Forests

Walmart — Store Sales Forecasting

Hotel Booking Demand EDA

Parking Lot Occupancy K-means Analysis in National Parks in the United States

Forecasting model for influenza A cases (H7N9) based on Random Forests

Walmart — Store Sales Forecasting