The aim of the project was to predict the business of the movie based on the words used in it. To get the words used in the movie, subtitles were used. Subtitles of over 1900 movies were collected from Opensubtitles.org. These movies include various Hindi, Telugu and Tamil movies subtitles. As the subtitles are in English language, irrespective whether the movie is Hindi, Tamil or Telugu the words are going to be English. I had used R language for this project.
The first and foremost thing was to clean the data. As always there is lot of noise in the data collected from the web. Removed all the numbers (time slots), stop words (is, of, and, the), punctuations from the text. Frequency of the words were calculated for that movies subtitles. Similarly, it was done for all the movies. A list of words which occur the most in the movie were calculated and were later aggregated for all the movies. So we now had words which were used most frequently across all the movies and also the count of the number movies in which that word occurred in. To train the training data, I used these properties along with budget required to make the movie.
To get the budget and the business made by the movie I used the WikipediR package in R which helps to scrape the data from the Wikipedia pages. I had other options like Bollywoodhungama, Koimoi and many such websites to get the business but the data was incomplete and not even easy to scrape. We have Wikipedia page for most of the movies. At Wikipedia, we can get all the data most of the data at the same place. For example, we have data like lead actor, actress, producer, director, movie budget, movie business and many other things at Wikipedia. So using the WikipediR package I scraped the data to get the budget and the business of the movie.
I used the glm function in R which is generalized linear model to model the data and predict the outcome. With the cross verification done on the test data my model was able to give 65-70 % accuracy on the test data. Some general observations from the project.
Most common (frequent) Words overall
Using which word will increase the chances of earning more
Most Costly Movie per word –
- Gundello Godari
- Krrish 3
- Seethamma Vakitlo Sirimalle Chettu
Although as of now I stopped the project here itself. This project has lot of scope in the future to explore.