Parking Lot Occupancy K-means Analysis in National Parks in the United States
Introduction
The objective of this study is to determine the correlation that exists in the levels of occupancy of park parking lots in the United States regarding the months, days and hours of an analyzed period from the development of a k-means algorithm. To achieve this, I’ll use the R programming language and a dataset of 35,717 records with data obtained from October to December 2016.
K Means is a grouping or clustering method. The term “k-means” was first used by James MacQueen in 1967, although the idea goes back to Hugo Steinhaus in 1957.
The standard algorithm was first proposed by Stuart Lloyd in 1957 as a technique for pulse code modulation, although it was not published outside of Bell Labs until 1982.
Case Study
The general objective of this research is to determine the correlation that exists in the occupancy levels of park parking lots in the United States, regarding the months, days and hours of the period analyzed through the development of a k-means s algorithm.
The specific objectives are as follows:
- Establish an exploratory analysis of the data through calculations of means, medians and other measures of central tendency.
- Develop and implement the K-means algorithm.
- Determine the correlation that exists in the occupancy levels with respect to the months, days and hours of the period analyzed.
Methodology
Clustering is a technique for finding and classifying K groups of data (clusters). Thus, elements that share similar characteristics will be together in the same group, separated from other groups with which they do not share characteristics.
To find out if the data are similar or different, the K-means algorithm uses the distance between the data. Observations that are similar will have a smaller distance between them. In general, the Euclidean distance is used as a measure, although other functions can also be used.
Clustering algorithms are considered unsupervised learning algorithms. This type of unsupervised learning algorithms looks for patterns in the data without having a specific prediction as a target (there is no dependent variable). Instead of having an output, the data has only one input which would be the multiple variables describing the data.
The K-means algorithm needs as input the number of groups into which we are going to segment the population. From this number k of clusters, the algorithm first places k random points (centroids). It then assigns to any of these points all the samples with the smallest distances. Next, the point is shifted to the mean of the nearest samples.
This will generate a new sample assignment, as some samples are now closer to another centroid. This process is repeated iteratively and the groups are adjusted until the allocation no longer changes by moving the points. This final result represents the adjustment that maximizes the distance between the different groups and minimizes the intra-group distance.
This type of unsupervised learning algorithm is useful for exploring, describing and summarizing data in a different way. Using this data clustering can help us to confirm (or reject) some kind of previous classification. It can also help us to discover patterns and relationships we were unaware of.
The dataset to be analyzed consists of the following characteristics:
- 35,717 tuples or rows. Four columns with the following data:
- SystemCode: Indicates the code associated with the parking lot.
- Capacity: Indicates the total number of stalls available in the parking lot.
- Occupancy: Indicates the number of parking spaces that are occupied at the time the report is issued.
- LastUpdated: Indicates the exact date on which the tuple data was recorded.
In order to manage the data more efficiently when implementing the k-means algorithm, the LastUpdated field was divided into four fields:
- Year: Year of tuple data record.
- Month: Month of the tuple data record.
- Day: Day of the tuple data record.
- Time: Time of the tuple data record. # Exploratory Analysis
Let’s import the cluster and fpc libraries, and create the dataset.
In [48]:
library(cluster) library(fpc) set.seed(500) dataset <- read.csv("../input/dataset/parking-clean-data.csv")
The minimum value, first quartile, median, mean, third quartile and maximum value are obtained for each of the five columns to be analyzed, and then box plots are generated with this information.
We can observe that the parking lots analyzed have a capacity that varies from 220 stalls to 4675 stalls, with a mean of 1398 stalls. Their occupancy varies from 0 stalls to 4327 stalls, with an average of 642.2 stalls.
Regarding the period analyzed, it ranges from month 10 to month 12 of 2016, with an average of 10.88. Regarding the days analyzed, a complete run is made from days 1 to 31, with an average of 15.16. The analyzed hours range from 7:30 to 16:34, with a mean of 12:05.
It is observed that the Capacity and Occupancy fields have outliers that may affect the effectiveness of the clustering process, however they will not be adjusted yet due to their high importance in this analysis process. After the clustering process, outliers in the created clusters will be removed in order to generate more accurate conclusions.
In [49]:
# Cast to numeric columns dataset$Capacity <- as.numeric(dataset$Capacity) dataset$Occupancy <- as.numeric(dataset$Occupancy) dataset$Month <- as.numeric(dataset$Month) dataset$Day <- as.numeric(dataset$Day) dataset$Time <- as.numeric(dataset$Time)
In [50]:
# Ignore NA values dataset <- na.omit(dataset,na.action=TRUE) # Create new dataset with the columns that will be used mydata <- dataset[,c(2,3,5:7)] # Get a summary of the dataset summary(mydata) boxplot(mydata)
Capacity Occupancy Month Day Time Min. : 220 Min. : 0.0 Min. :10.00 Min. : 1.00 Min. : 730 1st Qu.: 500 1st Qu.: 210.0 1st Qu.:10.00 1st Qu.: 8.00 1st Qu.:1000 Median : 849 Median : 446.0 Median :11.00 Median :15.00 Median :1204 Mean :1398 Mean : 642.2 Mean :10.88 Mean :15.16 Mean :1205 3rd Qu.:2009 3rd Qu.: 798.0 3rd Qu.:11.00 3rd Qu.:22.00 3rd Qu.:1429 Max. :4675 Max. :4327.0 Max. :12.00 Max. :31.00 Max. :1634
In [51]:
# Capacity boxplot boxplot(mydata$Capacity, main = "Capacity", xlab = "Cars", ylab = "Stalls", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE)
In [52]:
# Occupancy boxplot boxplot(mydata$Occupancy, main = "Occupancy", xlab = "Cars", ylab = "Stalls", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE)
In [53]:
# Month boxplot boxplot(mydata$Month, main = "Month", xlab = "Time", ylab = "Months", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE)
Warning message in bxp(list(stats = structure(c(10, 10, 11, 11, 12), .Dim = c(5L, : “some notches went outside hinges ('box'): maybe set notch=FALSE”
In [54]:
# Day boxplot boxplot(mydata$Day, main = "Day", xlab = "Time", ylab = "Days of the month", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE)
In [55]:
# Time boxplot boxplot(mydata$Time, main = "Hour", xlab = "Time", ylab = "Hours of the day", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE)
K-means Algorithm
The variables in the dataset are at different scales, so it is necessary to scale the dataset to maintain uniformity as a step prior to running the K-means algorithm.
To obtain the ideal number of clusters (k), the variance of the dataset is stored in an array and then iterated 15 times over the array. At each iteration, the sum of squares of the intra-cluster union strength factor (withinss) is recorded. The results are then plotted and the k value at which the slope of the curve shows the greatest change is selected.
In [56]:
# The dataset variables are at different scales. # To maintain uniformity, the columns are scaled scaled_data <- scale(mydata[,1:5]) ## Calculate the variance wss <- (nrow(scaled_data)-1)*sum(apply(scaled_data,2,var)) # Iterate over the wws array 15 times. # Record at each iteration the sum of squares of the factor of intra-cluster union strength for each cluster (withinss) for(i in 2:15)wss[i]<- sum(fit=kmeans(scaled_data,centers=i,15)$withinss) ## Graph each iteration plot(1:15,wss,type="b",main="15 clusters",xlab="# of cluster",ylab="sum of squares")
From this graph, we can observe that the slope has a more significant change when the number of clusters is equal to 3, which is why this value will be taken as the number of clusters to be used.
After applying the kmeans function, properties of the generated clusters are evaluated:
- Centers: vector indicating the mean of each variable for each cluster.
- Totss (Total sum of squares). Withinss (Intra-cluster sum of squares).
- Tot.withinss (Total sum of intra-cluster squares).
- Betweenss (Sum of inter-cluster squares).
- Size: Number of points in each cluster.
In [57]:
# The slope varies mainly in the third iteration, # so 3 is considered as the optimal number of clusters. fit <- kmeans(scaled_data, 3) fit$centers fit$totss fit$withinss fit$tot.withinss fit$betweenss fit$size
Capacity | Occupancy | Month | Day | Time | |
---|---|---|---|---|---|
1 | -0.4041041 | -0.3906482 | -0.679259247 | 0.556198370 | -0.07882166 |
2 | -0.3737569 | -0.3354071 | 0.870854168 | -0.714267704 | -0.04301146 |
3 | 1.6444271 | 1.5418053 | -0.005875559 | 0.007001652 | 0.26551874 |
178580
- 44049.45510661
- 30017.6406344365
- 32196.509518354
106263.605259401
72316.3947405994
- 16187
- 12672
- 6858
The intra-cluster distances are 44049.46, 30017.64 and 32196.51, while the inter-cluster distance is 72316.39.
Through the plotcluster function, a graph with the clustering result is obtained, and subsequently a matrix with the means of each cluster in each dimension is obtained to evaluate how different the clusters obtained are.
The first cluster has a size of 16187 points (red in the graph), the second is 12672 points (black in the graph) and the third is 6858 points (green in the graph).
From the graph obtained, it can be deduced that there are three large differentiated groups into which the data can be categorized. The groups are explained in detail below.
In [58]:
# Interpret patterns plotcluster(scaled_data,fit$cluster) points(fit$centers,col=1:5,pch=16) # Mean of each variable in each cluster mean_data <- dataset[,c(2,3,5:7)] mean_data <- data.frame(mean_data,fit$cluster) mean_ds <- aggregate(mean_data[,1:5],by = list(fit$cluster),FUN = mean) mean_ds
Group.1 | Capacity | Occupancy | Month | Day | Time |
---|---|---|---|---|---|
<int> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
1 | 920.9793 | 385.5916 | 10.37036 | 19.773522 | 1184.315 |
2 | 956.7685 | 421.8825 | 11.53614 | 9.239189 | 1193.640 |
3 | 3336.8672 | 1655.1260 | 10.87679 | 15.219743 | 1273.979 |
The results of the mean matrix are as follows:
- Group 1:
- Capacity: 920.9793
- Occupancy: 385.5916
- Month: 10.37036
- Day: 19.773522
- Time: 1184.315
- Group 2:
- Capacity: 956.7685
- Occupancy: 421.8825
- Month: 11.53614
- Day: 9.239189
- Time: 1193.640
- Group 3:
- Capacity: 3336.8672
- Occupancy: 1655.1260
- Month: 10.87679
- Day: 15.219743
- Time: 1273.979
These data can be synthesized as follows:
- Group 1: is composed of low and medium capacity parking lots, mainly during the month of October and part of the month of November. It consists of 1,6187 entries (45.32%).
- Group 2: is composed of low and medium capacity parking lots during the months of November and December. It consists of 12672 records (35.47%).
- Group 3: is composed of high capacity parking lots during the months of October, November and December. It consists of 6858 records (19.20%).
In [59]:
# Assign a variable for each cluster cluster1 = subset(mean_data, fit.cluster == 1) cluster2 = subset(mean_data, fit.cluster == 2) cluster3 = subset(mean_data, fit.cluster == 3) # summary of the clusters summary(cluster1)
Capacity Occupancy Month Day Time Min. : 220 Min. : 0.0 Min. :10.00 Min. : 4.00 Min. : 730 1st Qu.: 485 1st Qu.: 175.0 1st Qu.:10.00 1st Qu.:15.00 1st Qu.: 932 Median : 720 Median : 334.0 Median :10.00 Median :21.00 Median :1200 Mean : 921 Mean : 385.6 Mean :10.37 Mean :19.77 Mean :1184 3rd Qu.:1194 3rd Qu.: 560.0 3rd Qu.:11.00 3rd Qu.:26.00 3rd Qu.:1426 Max. :3883 Max. :1451.0 Max. :11.00 Max. :31.00 Max. :1634 fit.cluster Min. :1 1st Qu.:1 Median :1 Mean :1 3rd Qu.:1 Max. :1
In [60]:
boxplot(cluster1)
In [61]:
summary(cluster2)
Capacity Occupancy Month Day Min. : 220.0 Min. : 0.0 Min. :11.00 Min. : 1.000 1st Qu.: 485.0 1st Qu.: 187.0 1st Qu.:11.00 1st Qu.: 5.000 Median : 720.0 Median : 364.5 Median :12.00 Median : 9.000 Mean : 956.8 Mean : 421.9 Mean :11.54 Mean : 9.239 3rd Qu.:1200.0 3rd Qu.: 605.0 3rd Qu.:12.00 3rd Qu.:13.000 Max. :3103.0 Max. :1618.0 Max. :12.00 Max. :19.000 Time fit.cluster Min. : 732 Min. :2 1st Qu.: 956 1st Qu.:2 Median :1200 Median :2 Mean :1194 Mean :2 3rd Qu.:1426 3rd Qu.:2 Max. :1634 Max. :2
In [62]:
boxplot(cluster2)
In order to facilitate the analysis process, outliers in the Capacity dimension will be eliminated for clusters 1 and 2. Therefore, only those tuples whose Capacity column is less than 2000 positions will be retained. This eliminates 1372 records from cluster 1 and 1263 records from cluster 2, for a total of 2635 records (7.37% of the total data).
The results of the adjustment are presented below:
In [63]:
# Determine number of outliers in clusters 1 and 2 length(cluster1$Capacity[cluster1$Capacity>2000]) # 1372 length(cluster2$Capacity[cluster2$Capacity>2000]) # 1263 # Remove outliers from clusters 1 and 2 cluster1 = subset(cluster1, Capacity<2000) cluster2 = subset(cluster2, Capacity<2000)
1372
1263
In [64]:
summary(cluster1) summary(cluster2) summary(cluster3)
Capacity Occupancy Month Day Min. : 220.0 Min. : 0.0 Min. :10.00 Min. : 4.00 1st Qu.: 480.0 1st Qu.: 166.0 1st Qu.:10.00 1st Qu.:15.00 Median : 690.0 Median : 310.0 Median :10.00 Median :20.00 Mean : 759.6 Mean : 363.5 Mean :10.38 Mean :19.66 3rd Qu.:1010.0 3rd Qu.: 532.0 3rd Qu.:11.00 3rd Qu.:26.00 Max. :1920.0 Max. :1412.0 Max. :11.00 Max. :31.00 Time fit.cluster Min. : 730 Min. :1 1st Qu.: 959 1st Qu.:1 Median :1203 Median :1 Mean :1199 Mean :1 3rd Qu.:1429 3rd Qu.:1 Max. :1634 Max. :1
Capacity Occupancy Month Day Min. : 220.0 Min. : 0.0 Min. :11.00 Min. : 1.000 1st Qu.: 485.0 1st Qu.: 175.0 1st Qu.:11.00 1st Qu.: 5.000 Median : 690.0 Median : 332.0 Median :12.00 Median : 9.000 Mean : 770.5 Mean : 392.9 Mean :11.53 Mean : 9.335 3rd Qu.:1010.0 3rd Qu.: 568.0 3rd Qu.:12.00 3rd Qu.:13.000 Max. :1920.0 Max. :1586.0 Max. :12.00 Max. :19.000 Time fit.cluster Min. : 732 Min. :2 1st Qu.:1000 1st Qu.:2 Median :1225 Median :2 Mean :1207 Mean :2 3rd Qu.:1429 3rd Qu.:2 Max. :1634 Max. :2
Capacity Occupancy Month Day Time Min. :1920 Min. : 385 Min. :10.00 Min. : 1.00 Min. : 755 1st Qu.:2937 1st Qu.:1102 1st Qu.:10.00 1st Qu.: 9.00 1st Qu.:1101 Median :3103 Median :1363 Median :11.00 Median :15.00 Median :1303 Mean :3337 Mean :1655 Mean :10.88 Mean :15.22 Mean :1274 3rd Qu.:3883 3rd Qu.:2194 3rd Qu.:11.00 3rd Qu.:22.00 3rd Qu.:1459 Max. :4675 Max. :4327 Max. :12.00 Max. :31.00 Max. :1634 fit.cluster Min. :3 1st Qu.:3 Median :3 Mean :3 3rd Qu.:3 Max. :3
In [65]:
# cluster 1 # Capacity boxplot boxplot(cluster1$Capacity, main = "Capacity", xlab = "Cars", ylab = "Stalls", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE) # Occupancy boxplot boxplot(cluster1$Occupancy, main = "Occupancy", xlab = "Cars", ylab = "Stalls", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE) # Month boxplot boxplot(cluster1$Month, main = "Month", xlab = "Time", ylab = "Months", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE) # Day boxplot boxplot(cluster1$Day, main = "Day", xlab = "Time", ylab = "Days of month", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE) # Time boxplot boxplot(cluster1$Time, main = "Hour", xlab = "Time", ylab = "Hours of Day", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE)
Warning message in bxp(list(stats = structure(c(10, 10, 10, 11, 11), .Dim = c(5L, : “some notches went outside hinges ('box'): maybe set notch=FALSE”
In [66]:
# cluster 2 # Capacity boxplot boxplot(cluster2$Capacity, main = "Capacity", xlab = "Cars", ylab = "Stalls", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE) # Occupancy boxplot boxplot(cluster2$Occupancy, main = "Occupancy", xlab = "Cars", ylab = "Stalls", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE) # Month boxplot boxplot(cluster2$Month, main = "Month", xlab = "Time", ylab = "Months", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE) # Day boxplot boxplot(cluster2$Day, main = "Day", xlab = "Time", ylab = "Days of month", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE) # Time boxplot boxplot(cluster2$Time, main = "Hour", xlab = "Time", ylab = "Hours of Day", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE)
Warning message in bxp(list(stats = structure(c(11, 11, 12, 12, 12), .Dim = c(5L, : “some notches went outside hinges ('box'): maybe set notch=FALSE”
In [67]:
# cluster 3 # Capacity boxplot boxplot(cluster3$Capacity, main = "Capacity", xlab = "Cars", ylab = "Stalls", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE) # Occupancy boxplot boxplot(cluster3$Occupancy, main = "Occupancy", xlab = "Cars", ylab = "Stalls", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE) # Month boxplot boxplot(cluster3$Month, main = "Month", xlab = "Time", ylab = "Months", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE) # Day boxplot boxplot(cluster3$Day, main = "Day", xlab = "Time", ylab = "Days of month", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE) # Time boxplot boxplot(cluster3$Time, main = "Hour", xlab = "Time", ylab = "Hours of Day", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE)
Warning message in bxp(list(stats = structure(c(10, 10, 11, 11, 12), .Dim = c(5L, : “some notches went outside hinges ('box'): maybe set notch=FALSE”
Results and Conclutions
Based on the summary of the clusters and the box plots obtained, the following conclusions can be drawn.
General analysis of the clusters
Those parking lots with low and medium capacity (between 220 and 1920 stalls, with an average of 759.6) will have an average occupancy of 363.5 stalls during the month of October, which represents an occupancy of 47.85% of the average capacity. This occupancy increases during the months of November and December to 392.9 stalls out of an average capacity of 770.5, which constitutes 51% occupancy.
High capacity parking lots (between 1920 and 4675 stalls, with an average of 3337) will have an average occupancy of 1655, representing 49.6%.
Cluster analysis
Cluster 1
The parking lots with high occupancy (value above the third quartile, which is 532 stalls) have a similar occupancy during the days of the period analyzed.
Parking lots with intermediate occupancy (value higher than the first quartile, which is 166 stalls and lower than the third quartile, which is 532 stalls) have similar occupancy during the days of the period analyzed.
The parking lots with low occupancy (value lower than the first quartile which is 166) during the days of October and November are evenly distributed.
During the first hours of the day (value below the first quartile which is 9:59) there is a low occupancy, with a mean of 227.5 stalls, and in which 75% of the records (up to the third quartile) reach a maximum of 329 stalls, this is a value below the mean of this cluster (353.5 stalls).
During intermediate hours of the day (value above the first quartile 9:59 and below the third quartile 14:29) a high occupancy is recorded, with an average of 411.7 stalls, and in which 75% of the records (up to the third quartile) reach a maximum of 598 stalls.
During the last hours of the day (value above the third quartile which is 14:29) there is an intermediate occupancy, with an average of 405.2 stalls, and in which 75% of the records (up to the third quartile) reach a maximum of 567 stalls.
In [68]:
#DIAS # High occupancy days test = subset(cluster1, Occupancy > 532) summary(test) boxplot(test$Day, main = "High Occupancy", xlab = "Time", ylab = "Days of the month", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE) # Intermediate occupancy days test = subset(cluster1, Occupancy < 532 & Occupancy > 166) summary(test) boxplot(test$Day, main = "Intermediate Occupancy", xlab = "Time", ylab = "Days of the month", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE) # Low occupancy days test = subset(cluster1, Occupancy < 166) summary(test) boxplot(test$Day, main = "Low Occupancy", xlab = "Time", ylab = "Days of the month", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE) # Hours # Early Hours test = subset(cluster1, Time < 959) summary(test) boxplot(test$Occupancy, main = "Occupancy in the early hours", xlab = "Cars", ylab = "Stalls", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE) # Intermediate Hours test = subset(cluster1, Time > 959 & Time < 1429) summary(test) boxplot(test$Occupancy, main = "Ocupping at intermediate times", xlab = "Cars", ylab = "Stalls", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE) # Late hours test = subset(cluster1, Time > 1429) summary(test) boxplot(test$Occupancy, main = "Occupancy in the last hours", xlab = "Cars", ylab = "Stalls", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE)
Capacity Occupancy Month Day Time Min. : 577 Min. : 533.0 Min. :10.00 Min. : 4.00 Min. : 755 1st Qu.: 720 1st Qu.: 597.0 1st Qu.:10.00 1st Qu.:15.00 1st Qu.:1130 Median :1010 Median : 673.0 Median :10.00 Median :21.00 Median :1304 Mean :1066 Mean : 722.4 Mean :10.41 Mean :19.98 Mean :1289 3rd Qu.:1268 3rd Qu.: 796.0 3rd Qu.:11.00 3rd Qu.:26.00 3rd Qu.:1434 Max. :1920 Max. :1412.0 Max. :11.00 Max. :31.00 Max. :1634 fit.cluster Min. :1 1st Qu.:1 Median :1 Mean :1 3rd Qu.:1 Max. :1
Capacity Occupancy Month Day Min. : 220.0 Min. :167.0 Min. :10.00 Min. : 4.00 1st Qu.: 485.0 1st Qu.:224.0 1st Qu.:10.00 1st Qu.:14.00 Median : 600.0 Median :311.0 Median :10.00 Median :20.00 Mean : 697.6 Mean :320.5 Mean :10.37 Mean :19.42 3rd Qu.: 849.0 3rd Qu.:405.0 3rd Qu.:11.00 3rd Qu.:26.00 Max. :1920.0 Max. :531.0 Max. :11.00 Max. :31.00 Time fit.cluster Min. : 732 Min. :1 1st Qu.:1001 1st Qu.:1 Median :1226 Median :1 Mean :1212 Mean :1 3rd Qu.:1430 3rd Qu.:1 Max. :1634 Max. :1
Capacity Occupancy Month Day Min. : 220.0 Min. : 0.00 Min. :10.00 Min. : 4.00 1st Qu.: 387.0 1st Qu.: 57.00 1st Qu.:10.00 1st Qu.:15.00 Median : 470.0 Median : 93.00 Median :10.00 Median :20.00 Mean : 578.5 Mean : 91.86 Mean :10.36 Mean :19.84 3rd Qu.: 788.0 3rd Qu.:130.00 3rd Qu.:11.00 3rd Qu.:26.00 Max. :1268.0 Max. :165.00 Max. :11.00 Max. :31.00 Time fit.cluster Min. : 730 Min. :1 1st Qu.: 831 1st Qu.:1 Median :1001 Median :1 Mean :1082 Mean :1 3rd Qu.:1304 3rd Qu.:1 Max. :1634 Max. :1
Capacity Occupancy Month Day Min. : 220.0 Min. : 0.0 Min. :10.00 Min. : 4.00 1st Qu.: 480.0 1st Qu.: 78.0 1st Qu.:10.00 1st Qu.:15.00 Median : 690.0 Median : 180.0 Median :10.00 Median :20.00 Mean : 778.8 Mean : 227.5 Mean :10.38 Mean :19.54 3rd Qu.:1010.0 3rd Qu.: 329.0 3rd Qu.:11.00 3rd Qu.:26.00 Max. :1920.0 Max. :1262.0 Max. :11.00 Max. :31.00 Time fit.cluster Min. :730.0 Min. :1 1st Qu.:804.0 1st Qu.:1 Median :856.0 Median :1 Mean :858.8 Mean :1 3rd Qu.:926.0 3rd Qu.:1 Max. :958.0 Max. :1
Capacity Occupancy Month Day Min. : 220.0 Min. : 0.0 Min. :10.00 Min. : 4.00 1st Qu.: 470.0 1st Qu.: 203.0 1st Qu.:10.00 1st Qu.:15.00 Median : 690.0 Median : 358.0 Median :10.00 Median :20.00 Mean : 753.7 Mean : 411.7 Mean :10.39 Mean :19.82 3rd Qu.:1010.0 3rd Qu.: 598.0 3rd Qu.:11.00 3rd Qu.:26.00 Max. :1920.0 Max. :1412.0 Max. :11.00 Max. :31.00 Time fit.cluster Min. :1000 Min. :1 1st Qu.:1101 1st Qu.:1 Median :1204 Median :1 Mean :1200 Mean :1 3rd Qu.:1326 3rd Qu.:1 Max. :1428 Max. :1
Capacity Occupancy Month Day Min. : 220.0 Min. : 0.0 Min. :10.00 Min. : 4.00 1st Qu.: 470.0 1st Qu.: 197.0 1st Qu.:10.00 1st Qu.:14.00 Median : 690.0 Median : 368.0 Median :10.00 Median :20.00 Mean : 750.7 Mean : 405.2 Mean :10.38 Mean :19.34 3rd Qu.:1010.0 3rd Qu.: 567.0 3rd Qu.:11.00 3rd Qu.:26.00 Max. :1920.0 Max. :1263.0 Max. :11.00 Max. :31.00 Time fit.cluster Min. :1430 Min. :1 1st Qu.:1501 1st Qu.:1 Median :1531 Median :1 Mean :1539 Mean :1 3rd Qu.:1604 3rd Qu.:1 Max. :1634 Max. :1
Cluster 2
- Parking lots with high occupancy (value higher than the third quartile of 568 stalls) show similar occupancy during the days of the period analyzed.
- Parking lots with intermediate occupancy (value higher than the first quartile, which is 175 stalls and lower than the third quartile, which is 568 stalls) have similar occupancy during the days of the period analyzed.
- Parking lots with low occupancy (value lower than the first quartile which is 175 stalls) present a similar occupancy during the days of the analyzed period.
- During the first hours of the day (value below the first quartile which is 10:00) there is a low occupancy, with an average of 245.6 stalls, and in which 75% of the records (up to the third quartile) reach a maximum of 359.8 stalls, this is a value below the average of this cluster (392.9 stalls).
- During intermediate hours of the day (value above the first quartile 10:00 and below the third quartile 14:29) there is a high occupancy, with a mean of 451.2 stalls, and in which 75% of the records (up to the third quartile) reach a maximum of 647 stalls.
- During the last hours of the day (value above the third quartile which is 14:29) there is an intermediate occupancy, with an average of 430 stalls, and in which 75% of the records (up to the third quartile) reach a maximum of 602 stalls.
In [69]:
# Days # Days with high occupancy test = subset(cluster2, Occupancy > 568) summary(test) boxplot(test$Day, main = "High Occupancy", xlab = "Time", ylab = "Days of the Month", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE) # Days with intermediate occupancy test = subset(cluster2, Occupancy > 175 & Occupancy < 568) summary(test) boxplot(test$Day, main = "Intermediate Occupancy", xlab = "Time", ylab = "Days of the Month", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE) # Days with low occupancy test = subset(cluster2, Occupancy < 175) summary(test) boxplot(test$Day, main = "Low Occupancy", xlab = "Time", ylab = "Days of the Month", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE) # Hours # Early Hours test = subset(cluster2, Time < 1000) summary(test) boxplot(test$Occupancy, main = "Occupancy in the early hours", xlab = "Cars", ylab = "Stalls", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE) # Intermediate hours test = subset(cluster2, Time > 1000 & Time < 1429) summary(test) boxplot(test$Occupancy, main = "Occupancy in the intermediate hours", xlab = "Cars", ylab = "Stalls", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE) # Late hours test = subset(cluster2, Time > 1429) summary(test) boxplot(test$Occupancy, main = "Occupancy at last hours", xlab = "Cars", ylab = "Stalls", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE)
Capacity Occupancy Month Day Time Min. : 600 Min. : 569.0 Min. :11.0 Min. : 1.000 Min. : 756 1st Qu.: 863 1st Qu.: 641.0 1st Qu.:11.0 1st Qu.: 5.000 1st Qu.:1130 Median :1010 Median : 728.0 Median :12.0 Median : 9.000 Median :1303 Mean :1113 Mean : 787.7 Mean :11.6 Mean : 9.244 Mean :1294 3rd Qu.:1322 3rd Qu.: 860.0 3rd Qu.:12.0 3rd Qu.:14.000 3rd Qu.:1456 Max. :1920 Max. :1586.0 Max. :12.0 Max. :19.000 Max. :1634 fit.cluster Min. :2 1st Qu.:2 Median :2 Mean :2 3rd Qu.:2 Max. :2
Capacity Occupancy Month Day Time Min. : 220 Min. :176.0 Min. :11.00 Min. : 1.000 Min. : 740 1st Qu.: 485 1st Qu.:238.0 1st Qu.:11.00 1st Qu.: 5.000 1st Qu.:1000 Median : 600 Median :333.0 Median :12.00 Median : 9.000 Median :1226 Mean : 695 Mean :342.1 Mean :11.52 Mean : 9.249 Mean :1218 3rd Qu.: 849 3rd Qu.:434.0 3rd Qu.:12.00 3rd Qu.:13.000 3rd Qu.:1430 Max. :1920 Max. :567.0 Max. :12.00 Max. :19.000 Max. :1634 fit.cluster Min. :2 1st Qu.:2 Median :2 Mean :2 3rd Qu.:2 Max. :2
Capacity Occupancy Month Day Min. : 220.0 Min. : 0.0 Min. :11.00 Min. : 1.000 1st Qu.: 317.0 1st Qu.: 62.0 1st Qu.:11.00 1st Qu.: 6.000 Median : 470.0 Median :103.0 Median :11.00 Median :10.000 Mean : 580.5 Mean :100.4 Mean :11.48 Mean : 9.572 3rd Qu.: 788.0 3rd Qu.:143.0 3rd Qu.:12.00 3rd Qu.:13.000 Max. :1268.0 Max. :174.0 Max. :12.00 Max. :19.000 Time fit.cluster Min. : 732 Min. :2 1st Qu.: 850 1st Qu.:2 Median :1026 Median :2 Mean :1099 Mean :2 3rd Qu.:1327 3rd Qu.:2 Max. :1634 Max. :2
Capacity Occupancy Month Day Min. : 220 Min. : 2.0 Min. :11.00 Min. : 1.000 1st Qu.: 485 1st Qu.: 84.0 1st Qu.:11.00 1st Qu.: 5.000 Median : 690 Median : 188.0 Median :12.00 Median : 9.000 Mean : 785 Mean : 245.6 Mean :11.56 Mean : 9.312 3rd Qu.:1010 3rd Qu.: 359.8 3rd Qu.:12.00 3rd Qu.:13.000 Max. :1920 Max. :1329.0 Max. :12.00 Max. :19.000 Time fit.cluster Min. :732.0 Min. :2 1st Qu.:826.0 1st Qu.:2 Median :859.0 Median :2 Mean :867.2 Mean :2 3rd Qu.:927.0 3rd Qu.:2 Max. :959.0 Max. :2
Capacity Occupancy Month Day Min. : 220.0 Min. : 1.0 Min. :11.00 Min. : 1.000 1st Qu.: 485.0 1st Qu.: 215.0 1st Qu.:11.00 1st Qu.: 5.000 Median : 690.0 Median : 396.0 Median :12.00 Median : 9.000 Mean : 766.1 Mean : 451.2 Mean :11.53 Mean : 9.232 3rd Qu.:1010.0 3rd Qu.: 647.0 3rd Qu.:12.00 3rd Qu.:13.000 Max. :1920.0 Max. :1586.0 Max. :12.00 Max. :19.000 Time fit.cluster Min. :1002 Min. :2 1st Qu.:1109 1st Qu.:2 Median :1226 Median :2 Mean :1215 Mean :2 3rd Qu.:1327 3rd Qu.:2 Max. :1427 Max. :2
Capacity Occupancy Month Day Min. : 220.0 Min. : 0 Min. :11.00 Min. : 1.000 1st Qu.: 485.0 1st Qu.: 207 1st Qu.:11.00 1st Qu.: 6.000 Median : 690.0 Median : 386 Median :12.00 Median :10.000 Mean : 763.3 Mean : 430 Mean :11.53 Mean : 9.656 3rd Qu.:1010.0 3rd Qu.: 602 3rd Qu.:12.00 3rd Qu.:14.000 Max. :1920.0 Max. :1488 Max. :12.00 Max. :19.000 Time fit.cluster Min. :1430 Min. :2 1st Qu.:1500 1st Qu.:2 Median :1532 Median :2 Mean :1547 Mean :2 3rd Qu.:1603 3rd Qu.:2 Max. :1634 Max. :2
Cluster 3
- The parking lots with high occupancy (value higher than the third quartile, which is 2194 stalls) show similar occupancy during the days of the period analyzed.
- Parking lots with intermediate occupancy (value higher than the first quartile, which is 1102 stalls and lower than the third quartile, which is 2194 stalls) show similar occupancy during the days of the period analyzed.
- Parking lots with low occupancy (value lower than the first quartile which is 1102 stalls) present a similar occupancy during the days of the analyzed period.
- During the first hours of the day (value below the first quartile which is 11:01) there is a low occupancy, with an average of 1351 stalls, and in which 75% of the records (up to the third quartile) reach a maximum of 1573 stalls, this is a value below the average of this cluster (2194 stalls).
- During intermediate hours of the day (value above the first quartile 11:01 and below the third quartile 14:59) there is a higher occupancy, with an average of 1819 stalls, and in which 75% of the records (up to the third quartile) reach a maximum of 2567 stalls.
- During the last hours of the day (value above the third quartile which is 14:59) there is an intermediate occupancy, with an average of 1308 stalls, and in which 75% of the records (up to the third quartile) reach a maximum of 2182 stalls.
In [70]:
# Days # Days with high occupancy test = subset(cluster3, Occupancy > 2194) summary(test) boxplot(test$Day, main = "High Occupancy", xlab = "Time", ylab = "Days of the Month", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE) # Days with intermediate occupancy test = subset(cluster3, Occupancy > 1102 & Occupancy < 2194) summary(test) boxplot(test$Day, main = "Intermediate Occupancy", xlab = "Time", ylab = "Days of the Month", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE) # Days with low occupancy test = subset(cluster3, Occupancy < 1102) summary(test) boxplot(test$Day, main = "Low Occupancy", xlab = "Time", ylab = "Days of the Month", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE) # Hours # Early hours test = subset(cluster3, Time < 1101) summary(test) boxplot(test$Occupancy, main = "Occupancy in the early hours", xlab = "Cars", ylab = "Stalls", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE) # Intermediate hours test = subset(cluster3, Time > 1101 & Time < 1459) summary(test) boxplot(test$Occupancy, main = "Occupancy at intermediate hours", xlab = "Cars", ylab = "Stalls", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE) # Late hours test = subset(cluster3, Time > 1459) summary(test) boxplot(test$Occupancy, main = "Occupancy in the last hours", xlab = "Cars", ylab = "Stalls", col = "orange", border = "brown", horizontal = FALSE, notch = TRUE)
Capacity Occupancy Month Day Time Min. :3053 Min. :2195 Min. :10 Min. : 1.00 Min. : 856 1st Qu.:3883 1st Qu.:2556 1st Qu.:10 1st Qu.: 8.00 1st Qu.:1200 Median :3883 Median :2834 Median :11 Median :15.00 Median :1328 Mean :4105 Mean :2882 Mean :11 Mean :15.28 Mean :1318 3rd Qu.:4675 3rd Qu.:3185 3rd Qu.:12 3rd Qu.:22.00 3rd Qu.:1459 Max. :4675 Max. :4327 Max. :12 Max. :31.00 Max. :1634 fit.cluster Min. :3 1st Qu.:3 Median :3 Mean :3 3rd Qu.:3 Max. :3
Capacity Occupancy Month Day Time Min. :1920 Min. :1103 Min. :10.00 Min. : 1.00 Min. : 755 1st Qu.:2009 1st Qu.:1229 1st Qu.:10.00 1st Qu.: 9.00 1st Qu.:1127 Median :3053 Median :1363 Median :11.00 Median :15.00 Median :1304 Mean :2979 Mean :1436 Mean :10.86 Mean :15.38 Mean :1288 3rd Qu.:3103 3rd Qu.:1551 3rd Qu.:11.00 3rd Qu.:22.00 3rd Qu.:1457 Max. :4675 Max. :2193 Max. :12.00 Max. :31.00 Max. :1634 fit.cluster Min. :3 1st Qu.:3 Median :3 Mean :3 3rd Qu.:3 Max. :3
Capacity Occupancy Month Day Time Min. :1920 Min. : 385.0 Min. :10.00 Min. : 1.00 Min. : 755 1st Qu.:2937 1st Qu.: 727.0 1st Qu.:10.00 1st Qu.: 9.00 1st Qu.: 930 Median :3103 Median : 904.0 Median :11.00 Median :14.00 Median :1159 Mean :3284 Mean : 865.7 Mean :10.79 Mean :14.83 Mean :1202 3rd Qu.:3103 3rd Qu.:1016.0 3rd Qu.:11.00 3rd Qu.:20.00 3rd Qu.:1501 Max. :4675 Max. :1101.0 Max. :12.00 Max. :31.00 Max. :1634 fit.cluster Min. :3 1st Qu.:3 Median :3 Mean :3 3rd Qu.:3 Max. :3
Capacity Occupancy Month Day Time Min. :1920 Min. : 467 Min. :10.00 Min. : 1.00 Min. : 755.0 1st Qu.:3053 1st Qu.: 902 1st Qu.:10.00 1st Qu.: 8.00 1st Qu.: 900.0 Median :3883 Median :1140 Median :11.00 Median :14.00 Median : 957.0 Mean :3687 Mean :1351 Mean :10.86 Mean :14.83 Mean : 949.2 3rd Qu.:4675 3rd Qu.:1573 3rd Qu.:11.00 3rd Qu.:21.00 3rd Qu.:1029.0 Max. :4675 Max. :3384 Max. :12.00 Max. :31.00 Max. :1100.0 fit.cluster Min. :3 1st Qu.:3 Median :3 Mean :3 3rd Qu.:3 Max. :3
Capacity Occupancy Month Day Time Min. :1920 Min. : 474 Min. :10.00 Min. : 1.00 Min. :1102 1st Qu.:2937 1st Qu.:1225 1st Qu.:10.00 1st Qu.: 9.00 1st Qu.:1203 Median :3053 Median :1444 Median :11.00 Median :15.00 Median :1303 Mean :3227 Mean :1819 Mean :10.89 Mean :15.19 Mean :1289 3rd Qu.:3883 3rd Qu.:2567 3rd Qu.:11.00 3rd Qu.:21.00 3rd Qu.:1400 Max. :4675 Max. :4270 Max. :12.00 Max. :31.00 Max. :1458 fit.cluster Min. :3 1st Qu.:3 Median :3 Mean :3 3rd Qu.:3 Max. :3
Capacity Occupancy Month Day Time Min. :1920 Min. : 385 Min. :10.00 Min. : 1.00 Min. :1500 1st Qu.:2937 1st Qu.:1076 1st Qu.:10.00 1st Qu.: 9.00 1st Qu.:1527 Median :3053 Median :1308 Median :11.00 Median :15.00 Median :1557 Mean :3211 Mean :1617 Mean :10.87 Mean :15.35 Mean :1566 3rd Qu.:3883 3rd Qu.:2182 3rd Qu.:11.00 3rd Qu.:22.00 3rd Qu.:1625 Max. :4675 Max. :4327 Max. :12.00 Max. :31.00 Max. :1634 fit.cluster Min. :3 1st Qu.:3 Median :3 Mean :3 3rd Qu.:3 Max. :3
General Conclusions
There is no direct correlation between the occupancy level and the day of the month, but there is a correlation between the occupancy level and the time of day.
The highest occupancy values occur in intermediate hours (between 10:00 and 14:59 approximately), while the lowest occupancy values are recorded in the early hours of the day (before 10:00).
We can also observe a slight increase in occupancy levels from the second half of November to December compared to the period from October to the first half of November.