top of page

DATA COLLECTION

Taxi and Restaurant Data

COLLECTING TAXI DATA

I sourced the NYC Yellow Taxi data from NYC Open Data. The data include fields, such as pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, fare amounts, rate types, payment types, and passenger counts. They provided data from 2009 to 2018. The problem I faced initially was that each dataset was 5+ GB. Because of the size, it was infeasible to analyse the data on R. I decided to analyse a sizeable subset of the data. The question I faced was which subset. After considering multiple options, including randomisation, I came to the conclusion that I should select a specific day and time period to deconstruct. I considered NSO, Labour Day, St. Patrick's Day, and the night before Thanksgiving, which is apparently the most popular time of the year to go out. But I faced issues with each of those days because of timing discrepancies (NSO for each school), cultural associations (St. Patrick's Day), the flight of students (Thanksgiving), and other problems. I came to the conclusion that the best day to analyse would be Halloween night. Because Halloween also fell on weekdays, I made the assumption that if it falls on Sunday, Monday, Tuesday, or Wednesday, it would be celebrated the prior Saturday. I selected the time frame 7pm-12am as that is when most individuals leave to go out. I wanted to analyse day-time activities, but that will have to be a future project.

LOADING THE DATA

Due to the fact that there were multiple datasets, I used a loop to load the data into R, and dynamically assigned names to each dataset. Because I was I analysing Columbia and NYU students, I outlined the campuses of each school on Google maps and generated coordinates of the extremes. Using these coordinates, I filtered out entries in each dataset where the pick up location was not within those boundaries. I effectively assumed that if a ride was initiated within the school's campus, it is likely to be a student. Invariably, this assumption is not certain due to the fact that NYC attracts numerous tourists and individuals travel frequently travel within the city. At the end, I had 2 data frames (NYU and Columbia) for each year (2009-2018), resulting in a total of 18 datasets (2010 data wasn't available online). For simplicity's sake, I decided only to consider the 2009, 2012, 2015, and 2018 datasets (increments of 3), as they would be sufficient to distill trends over time. The 2018 data came with zones as opposed to coordinates, so I used a database online to find the neighbourhood related to each entry.

Want to learn more about our services? Contact us today.

Data Collection: Services

©2019 by My Site. Proudly created with Wix.com

bottom of page