DATA MANIPULATION AND ANALYSIS

Screenshot 2019-12-10 at 22.00_edited.jpg

K-MEANS CLUSTERING

After segmenting and filtering the data, I started laying out the foundation of the k-means clustering analysis, which would allow me to group drop-offs to determine hotspots in the City. I started by trying to figure out the optimal number of clusters such that the total intra-cluster variation is minimized. I used the elbow method to derive this number. To my surprise, the variation was minimized at a very small number of clusters. Beyond approximately 15, the marginal decrease in variation was negligible. Understanding this was the case, I still clustered data at a value of 100 since there were obviously a greater number of restaurants, bars, and clubs visited by the students.

FILTERING DATA TO DETERMINE HOTSPOTS

After assigning cluster values to each entry, I arranged the data by frequency of cluster value. This gave me the cluster values with the highest number of entries associated with it. I interpreted these cluster values as corresponding to NYC hotspots. The first time I ran this, I realised that some of the coordinate data was faulty as the pick-up and drop-off coordinates were identical, so I went back and added a filter to remove these erroneous entries. After determining the key cluster values, I filtered the original data frames (16 of them) to only include entries from the 3 most popular cluster values for each dataset.

INITIAL MAPPING OF ENTRIES

I used the entries derived after filtering for various criteria, such as keeping the top three clusters, because the initial data was too noisy and massive to analyze. I used the Leaflet package to plot the coordinates. At first, I plotted a consolidated map (with data of both universities and all years). I differentiated the schools and years using a colour palette. The results were too chaotic to interpret.

NEXT ITERATION OF MAPPING

I realized that if I was going to successfully gain any insights from the map, I was going to need to be able to filter out datasets (year/school). I used the layer functionality of the Leaflet package to achieve this. This allowed me to not only colour each variation a differently, but also remove data points for deeper analysis. As stated earlier, I wasn't able to plot 2018 points as coordinates were not provided for that year.

FINAL ITERATION OF MAPPING

The final version of mapping I used was useful because it helped visualize clustering. The marker cluster feature in the Leaflet package was useful in viewing the density of markers in a manageable manner. It helped with the overall analysis. In the future, I would like to explore the possibility of combining the layer and clustering features.

Want to learn more about our services? Contact us today.

See Results

Analysis: Services