Data Acquisition/Cleaning: Restaurant Recommendation NYC

The Problem

I recently moved to NYC and a friend asked me to choose a spot for us to meet.  He was in the Upper East Side and I was deep in Brooklyn.  There were easily ten thousand different spots for us to meet between our locations, so I jumped into the world of apps dedicated to making this easier.

There are issues with each one of these approaches.  Mainly they just aren’t narrowing down the list enough and require you to make a lot of “blind” decisions to get to some kind of list.  Blind decisions mean filtering by cuisine (a lot are mislabeled), filtering by cost/rating, and filtering by location.  Let’s see how easy it is to make a decision by just filtering by location.  (Dataset provided by the city of NY)

As you can see there are an overwhelming amount of restaurants and cuisines to choose from.  While you may be able to narrow by cuisine and then by location there is still the very obvious fact that two Italian restaurants may not be at all of similar quality. This bring me back to Yelp.  They have filtering options, so I figured the best place to start this journey was by acquiring all of the information they have on restaurants in NYC.

Data Acquisition:

The Yelp API had some major limitations, the main being the limited amount of restaurants I could pull each day, so I decided to build my own scraper using BeautifulSoup.  Using BeautifulSoup, I created a bot which searched for the terms "restaurant" and "bar" in every zip-code in Manhattan and Brooklyn (80+) and pulled the url for each restaurant on the first 100 result pages.  

Sample Restaurant

The next step was to scrape each restaurant page for all the useful information (I was very much interested in the reviews, thinking they would tell a better story than just the listed attributes of the restaurant).  The above two images show you a good amount of the info I wanted so I built another scraper bot and pulled everything I could from every restaurant page.

Data Cleaning:

Needless to say the data was messy.  Very messy.  I had 12 columns of data that initially parsed out to over 30 columns.  It took a long time to parse out all of the info on this page.  At first it was mostly regex and and simple string cleaning.  As things got more polished I ended up filtering addresses to make sure only Manhattan and Brooklyn addresses were included, then used Google's Geocoder to get latitude and longitude for each location.  Below is a look at the restaurants and some of their info I was able to pull. 

Note you can interact with any aspect of this dashboard to filter by whatever you like.  

Next Steps:

The next post will include some basic exploratory data analysis and get into the natural language processing I did on the reviews in order to build a good recommendation model.



A  brief description of how I built a predictive model using Natural Language Processing on IMDB ratings to suggest which movies might be up or down voted on Netflix.


Netflix "hired" me to figure out what features are important to look for in regards to an up or down voted movie. Netflix, however did not want to give me access to their database of movies, so I had to acquire the data from somewhere.  IMDB seemed to be the obvious choice.  Due to some limitations to how they lay out their website I decided to go with a list site that supposedly had every wide release movie from 1976-2016.   This list site can be foudn here:

My approach was to "scrape" every unique ID from that site and then use it to scrape every individual IMDB page for the information I needed.  I ended up with about 10,000 movies in the end.   My goal for the data was to create a model that could predict an up or down vote using only features that were available before release.  For example, I wanted to use actors, directors, etc to predict rating and drop features like the awards won, box office take, or IMDB votes.  I consider it to be a more pure approach to the problem because it allows Netflix to start negotiations before movie release rather than wait for the film's results to start negotiations.

Data Cleaning

I am aware that I do not have the most extensive list of movies, nor do I have information on how they rated with Netflix in the end but I figured with roughly 10k movies I could make up for this lack of random distribution.

The data cleaning process was simple.  I got rid of any movies without ratings.  Any rows that weren't movies, and basically any rows with missing information.  I removed all movies with strange ratings, released before 1970, or of a runtime less than 45 mins or greater than 210 mins.  After that I changed the IMDB Rating from a numerical range to either 1 for up vote or 0 for down vote  to make for an easier model.

Data Modeling

This is the fun stuff.  In order to figure out who the important actors, writers, directors were I need to build a model.  The way I did this was to do some natural language processing to easily pull out all the names of the actors, for an example.  I then used all of those names in a LogisticRegression to pull out who was most important.  By important I mean impactful above a set threshold (which I tweaked until I got the best results).  This could mean positive or negative impact.  These are the results of that modeling. 


The actors, writers, directors in black have a positive influence on movies, the red ones have a negative impact. This is a nice short list to give over to Netflix to tell them which people to look for in movies.  After doing a lot more modeling than this on different features and their correlation to the IMDB Rating I found that runtime, genre and year were also important things to keep watch for.  With all of that information inputted into my model I am able to predict with about 75% accuracy whether or not a movie will be rated high or low.  

In the end I was able to pull out a lot of features that were important, but there is more work to be done.  As always more movies would be helpful, splitting it into decades would probably glean better accuracy, and getting individual ratings for actors/writers/directors involved would be helpful.  But for now I can confidently tell Netflix that a longer movie written and directed by Woody Allen with Leonardo DiCaprio as the lead will do well on Netflix (as long as Jean-Claude Van Damme doesn't make an appearance).