Data Acquisition/Cleaning: Restaurant Recommendation NYC

The Problem

I recently moved to NYC and a friend asked me to choose a spot for us to meet.  He was in the Upper East Side and I was deep in Brooklyn.  There were easily ten thousand different spots for us to meet between our locations, so I jumped into the world of apps dedicated to making this easier.

There are issues with each one of these approaches.  Mainly they just aren’t narrowing down the list enough and require you to make a lot of “blind” decisions to get to some kind of list.  Blind decisions mean filtering by cuisine (a lot are mislabeled), filtering by cost/rating, and filtering by location.  Let’s see how easy it is to make a decision by just filtering by location.  (Dataset provided by the city of NY)

As you can see there are an overwhelming amount of restaurants and cuisines to choose from.  While you may be able to narrow by cuisine and then by location there is still the very obvious fact that two Italian restaurants may not be at all of similar quality. This bring me back to Yelp.  They have filtering options, so I figured the best place to start this journey was by acquiring all of the information they have on restaurants in NYC.

Data Acquisition:

The Yelp API had some major limitations, the main being the limited amount of restaurants I could pull each day, so I decided to build my own scraper using BeautifulSoup.  Using BeautifulSoup, I created a bot which searched for the terms "restaurant" and "bar" in every zip-code in Manhattan and Brooklyn (80+) and pulled the url for each restaurant on the first 100 result pages.  

Sample Restaurant

The next step was to scrape each restaurant page for all the useful information (I was very much interested in the reviews, thinking they would tell a better story than just the listed attributes of the restaurant).  The above two images show you a good amount of the info I wanted so I built another scraper bot and pulled everything I could from every restaurant page.

Data Cleaning:

Needless to say the data was messy.  Very messy.  I had 12 columns of data that initially parsed out to over 30 columns.  It took a long time to parse out all of the info on this page.  At first it was mostly regex and and simple string cleaning.  As things got more polished I ended up filtering addresses to make sure only Manhattan and Brooklyn addresses were included, then used Google's Geocoder to get latitude and longitude for each location.  Below is a look at the restaurants and some of their info I was able to pull. 

Note you can interact with any aspect of this dashboard to filter by whatever you like.  

Next Steps:

The next post will include some basic exploratory data analysis and get into the natural language processing I did on the reviews in order to build a good recommendation model.