DATA COLLECTION

Taxi and Restaurant Data

Screenshot 2019-12-10 at 21.32_edited.jpg

COLLECTING TAXI DATA

I sourced the NYC Yellow Taxi data from NYC Open Data. The data include fields, such as pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, fare amounts, rate types, payment types, and passenger counts. They provided data from 2009 to 2018. The problem I faced initially was that each dataset was 5+ GB. Because of the size, it was infeasible to analyse the data on R. I decided to analyse a sizeable subset of the data. The question I faced was which subset. After considering multiple options, including randomisation, I came to the conclusion that I should select a specific day and time period to deconstruct. I considered NSO, Labour Day, St. Patrick's Day, and the night before Thanksgiving, which is apparently the most popular time of the year to go out. But I faced issues with each of those days because of timing discrepancies (NSO for each school), cultural associations (St. Patrick's Day), the flight of students (Thanksgiving), and other problems. I came to the conclusion that the best day to analyse would be Halloween night. Because Halloween also fell on weekdays, I made the assumption that if it falls on Sunday, Monday, Tuesday, or Wednesday, it would be celebrated the prior Saturday. I selected the time frame 7pm-12am as that is when most individuals leave to go out. I wanted to analyse day-time activities, but that will have to be a future project.

68013_119_w1-15_s_md.gif

Screenshot 2019-12-04 at 17.23.14.png

Screenshot 2019-12-10 at 21.45.24.png

68013_119_w1-15_s_md.gif

LOADING THE DATA

Due to the fact that there were multiple datasets, I used a loop to load the data into R, and dynamically assigned names to each dataset. Because I was I analysing Columbia and NYU students, I outlined the campuses of each school on Google maps and generated coordinates of the extremes. Using these coordinates, I filtered out entries in each dataset where the pick up location was not within those boundaries. I effectively assumed that if a ride was initiated within the school's campus, it is likely to be a student. Invariably, this assumption is not certain due to the fact that NYC attracts numerous tourists and individuals travel frequently travel within the city. At the end, I had 2 data frames (NYU and Columbia) for each year (2009-2018), resulting in a total of 18 datasets (2010 data wasn't available online). For simplicity's sake, I decided only to consider the 2009, 2012, 2015, and 2018 datasets (increments of 3), as they would be sufficient to distill trends over time. The 2018 data came with zones as opposed to coordinates, so I used a database online to find the neighbourhood related to each entry.

Want to learn more about our services? Contact us today.

See Analysis

Data Collection: Services