Saturday, 30 July 2016


Hello, welcome to my blog. I know it’s been long since my last post – I apologize for that. I have been quite busy for the past few weeks with some projects and I have not had any time to write. One of the things that has kept me busy is some of the courses I have been taking on Coursera – particularly the Statistics with R specialization. In this post, I will present the project I did for one of the courses of this specialization.

The project involved answering a research question using a dataset containing information from Rotten Tomatoes and IMDB for a random sample of movies released before 2016. The goal of this project was to identify attributes of movies that make them popular using a multiple linear regression model. To view the project click here.

For my research question, I developed two multiple regression models for predicting audience and critics scores for a movie respectively. 

The results of the project indicate that the ‘Drama’ genre is significantly associated with higher audience scores (since its coefficient is positive). Concretely, if all other variables in the model are held constant, the audience score for movies that belong to the Drama genre is expected on average to be higher by about 10.3 points.

We also see that the variable best_dir_win (i.e. whether or not the director of the movie has won an Oscar or not) is significantly associated with higher critics scores. Concretely, if all other variables in the model are held constant, the critic’s score for movies whose directors have won an Oscar is expected on average to be higher by about 13.4 points. Note that even though this variable is a significant predictor of critic’s scores, it is not a significant predictor of audience scores. This should not come as a surprise because I think audiences may not really care if the director of a movie has won an Oscar whereas it may be important to critics.

Next, I predicted the critic’s and audience scores for the movie X-Men Apocalypse. I obtained a critics score of 60% with a 95% confidence interval of (46.1, 73.2) and an audience score of 69% with a 95% confidence interval of (59.4, 79.0). In other to verify how good my predictions were, I checked the Rotten Tomatoes website for the actual scores for the movie. 

The audience score for the movie was 71% which is pretty close to the model’s prediction of 69% while the critics for the movie was 48% which although is not very close to our prediction of 60% is still within the 95% confidence interval. 

Note that even though we did not get the exact prediction correct, the 95% confidence interval we obtained contained the critics/audience score for the movie which is very good.

In this blog post, I presented a project which aims to identify the variables associated with higher or lower critics/audience scores. I interpreted the results of the project and I also used the model to predict the critics and audience score for a movie.

I have some news I would like to announce. I am currently in a sort of ‘partnership’ with Jumia travel so I will be publishing some articles related to travel, tourism and hospitality in Nigeria in the coming weeks. The first of these posts should be out hopefully this coming week. I hope you enjoy them in addition to my posts about data science and machine learning.

Thank you once again for reading my blog – in spite of my absence. I will try to make my posts more frequent in the future. Please subscribe to my blog in case you have not done so. Until next time. Cheers!!!