A  brief description of how I built a predictive model using Natural Language Processing on IMDB ratings to suggest which movies might be up or down voted on Netflix.


Netflix "hired" me to figure out what features are important to look for in regards to an up or down voted movie. Netflix, however did not want to give me access to their database of movies, so I had to acquire the data from somewhere.  IMDB seemed to be the obvious choice.  Due to some limitations to how they lay out their website I decided to go with a list site that supposedly had every wide release movie from 1976-2016.   This list site can be foudn here:

My approach was to "scrape" every unique ID from that site and then use it to scrape every individual IMDB page for the information I needed.  I ended up with about 10,000 movies in the end.   My goal for the data was to create a model that could predict an up or down vote using only features that were available before release.  For example, I wanted to use actors, directors, etc to predict rating and drop features like the awards won, box office take, or IMDB votes.  I consider it to be a more pure approach to the problem because it allows Netflix to start negotiations before movie release rather than wait for the film's results to start negotiations.

Data Cleaning

I am aware that I do not have the most extensive list of movies, nor do I have information on how they rated with Netflix in the end but I figured with roughly 10k movies I could make up for this lack of random distribution.

The data cleaning process was simple.  I got rid of any movies without ratings.  Any rows that weren't movies, and basically any rows with missing information.  I removed all movies with strange ratings, released before 1970, or of a runtime less than 45 mins or greater than 210 mins.  After that I changed the IMDB Rating from a numerical range to either 1 for up vote or 0 for down vote  to make for an easier model.

Data Modeling

This is the fun stuff.  In order to figure out who the important actors, writers, directors were I need to build a model.  The way I did this was to do some natural language processing to easily pull out all the names of the actors, for an example.  I then used all of those names in a LogisticRegression to pull out who was most important.  By important I mean impactful above a set threshold (which I tweaked until I got the best results).  This could mean positive or negative impact.  These are the results of that modeling. 


The actors, writers, directors in black have a positive influence on movies, the red ones have a negative impact. This is a nice short list to give over to Netflix to tell them which people to look for in movies.  After doing a lot more modeling than this on different features and their correlation to the IMDB Rating I found that runtime, genre and year were also important things to keep watch for.  With all of that information inputted into my model I am able to predict with about 75% accuracy whether or not a movie will be rated high or low.  

In the end I was able to pull out a lot of features that were important, but there is more work to be done.  As always more movies would be helpful, splitting it into decades would probably glean better accuracy, and getting individual ratings for actors/writers/directors involved would be helpful.  But for now I can confidently tell Netflix that a longer movie written and directed by Woody Allen with Leonardo DiCaprio as the lead will do well on Netflix (as long as Jean-Claude Van Damme doesn't make an appearance).