Part 1: Data

Movies are a compelling artform. But while nearly everyone enjoys a good movie, what is it exactly that makes a movie popular? To begin to answer this question, 651 movies were drawn at random from the Rotten Tomatoes and IMDB websites through their respective APIs. This observational study allows us to make inferences about the general population of movies scored by the Rotten Tomatoes website.

Bayesian Regression will be used to generate a model from the data and a prediction will be made on a movie released in 2016.

Load packages & data

library(ggplot2)
library(dplyr)
library(tidyr)
library(statsr)
library(BAS)
load(paste0(getwd(),"/_e1fe0c85abec6f73c72d73926884eaca_movies.Rdata"))

Part 2: Data manipulation

The dataset consists of 32 different variables. The analysis focused on only 12 variables, some of which were created through the use of ‘dplyr.’ The new variables were converted to factors so that ggplot() would plot them correctly. It was discovered that one of the movies had a missing run-time. The corresponding title was found in the dataset and the runtime was filled in from the Rotten Tomatoes website. The ‘audience score’ and ‘critics score’ variables were divided by 100.

movies <- movies %>% 
        mutate(feature_film = ifelse(title_type=="Feature Film","yes","no")) %>% 
        mutate(drama = ifelse(genre=="Drama","yes","no")) %>% 
        mutate(mpaa_rating_R = ifelse(mpaa_rating=="R","yes","no")) %>% 
        ## Assuming that 1 = January, 2 = February, etc.
        mutate(oscar_season = ifelse(thtr_rel_year==10,"yes",
                                     ifelse(thtr_rel_month==11,"yes",
                                            ifelse(thtr_rel_month==12,"yes","no")))) %>% 
        mutate(summer_season = ifelse(thtr_rel_year==5,"yes",
                                      ifelse(thtr_rel_month==6,"yes",
                                             ifelse(thtr_rel_month==7,"yes",
                                                   ifelse(thtr_rel_month==8,"yes","no"))))) %>% 
        mutate(aud100 = audience_score/100) %>% 
        mutate(critics_score100 = critics_score/100)

## Convert to factors those which are characters
movies$feature_film <- as.factor(movies$feature_film) 
movies$drama <- as.factor(movies$drama)
movies$mpaa_rating_R <- as.factor(movies$mpaa_rating_R) 
movies$oscar_season <- as.factor(movies$oscar_season) 
movies$summer_season <- as.factor(movies$summer_season)

## This movie is missing it's runtime, which on rottentomatoes is listed as 71 minutes
movies[which(is.na(movies$runtime)), "runtime"] <- 71

Part 3: Exploratory data analysis

Exploratory data analysis was conducted on the newly created variables. Combining the ‘%>%’ operator with the ‘.’ operator was allows us to bypass the creation of additional variables.

movies %>% 
        select(aud100, feature_film, drama, 
               mpaa_rating_R, oscar_season, summer_season) %>% 
        summary(.)
##      aud100       feature_film drama     mpaa_rating_R oscar_season
##  Min.   :0.1100   no : 60      no :346   no :322       no :530     
##  1st Qu.:0.4600   yes:591      yes:305   yes:329       yes:121     
##  Median :0.6500                                                    
##  Mean   :0.6236                                                    
##  3rd Qu.:0.8000                                                    
##  Max.   :0.9700                                                    
##  summer_season
##  no :487      
##  yes:164      
##               
##               
##               
## 

The output from summary() indicates that most movies are feature_films, about half are dramas and are rated R, most occurred outside of the usual “oscar season,” and fair number were also outside of the summer season. The highest Rotten Tomato score is 97 and the lowest is 11, with mean/median of 62.36/65.

Let’s examine how the Rotten Tomato (RT) scores are distributed for each of these newly engineered explanatory variables.

movies %>% 
        select(aud100, feature_film, drama, 
               mpaa_rating_R, oscar_season, summer_season) %>% 
        gather("Group","Status",2:6) %>% 
        ggplot(.) + 
        geom_boxplot(aes(as.factor(Group),aud100)) + 
        theme_bw() + 
        facet_grid(Status ~.) + 
        xlab("Engineered Explanatory Variables") + 
        ylab("RT Audience Score") + 
        ggtitle("Boxplots of audience score by variable")

data-boxplot-1.png

The boxplots show that that movies skew towards higher RT scores regardless of whether they have the particular feature or not. The interesting exception is the case of movies which are not feature_films. Movies which are not considered feature_films have a much higher distribution of RT scores than:

  • movies which are feature films
  • movies which have any of the other engineered features

Let’s look at the distribution of RT scores with respect to our “feature film” variable.

movies %>% 
        select(aud100, feature_film) %>%
        ggplot(.) +
        geom_density(aes(aud100, fill = feature_film, 
                         color = feature_film), alpha = 0.3) + 
        xlab("RT Audience Score") + 
        ylab("Density") + 
        ggtitle("Density of Scores by Feature Film Status")

data-density-1.png

Movies which are “feature films” have a more even distribution of RT scores while movies which are not “feature films” have a distribution which is skewed towards higher RT scores.

Finally, the probability of each “newly-engineered” variable as a function of RT score is plotted. Logistic regression was used to find these probabilities.

var.list <- c("feature_film", "drama","mpaa_rating_R","oscar_season","summer_season")
pv <- data.frame(aud100 = movies$aud100)
for( i in var.list) {
       prob <- movies %>% 
                select_("aud100", i) %>% 
                glm(formula = paste0(i," ~ aud100"), 
                    data = ., 
                    family = "binomial") %>% 
                predict(., 
                        newdata = as.data.frame(movies[,c("aud100")]), 
                        type="response")
       
       pv <- cbind(pv, prob)
}

names(pv)[2:6] <- var.list

pv %>% 
        gather("Variable","Probability",2:6) %>% 
        ggplot(.) + 
        geom_line(aes(aud100, Probability, color = Variable)) + 
        theme_bw() + 
        xlab("RT Audience Score") + 
        ggtitle("Probability Movie Will Have Explanatory Variable")

 

 

data-logistic-1.png

It is interesting that there is a steep drop off in the the probability that the movie is a feature_film with increasing RT score. The probability that a movie will have occurred in the summer_season decreases while the probability that a movie will have occurred in the oscar season increases with each additional percentage point of RT score.

Part 4: Modeling

The following explanatory variables were used in modeling:

  • Whether it is a feature film, or not
  • Whether it Is a drama, or not
  • The length of the moive
  • Whether it is rated R, or not
  • The year the movie was released
  • Whether the movie was released during the oscar season, or not
  • Whether the movie was released during the summer season, or not
  • The number of votes the movie recieved on imdb
  • The RT Critics score
  • Whether the movie was nominated for best picture or not
  • Whether the movie won best picture or not
  • Wether the lead actor won best actor, or not
  • Whether the lead actress won best actress, or not
  • Whether the director won best director, or not
  • Whether the movie was in the top 200 in box office sales, or not

The modeling data was sequestered from the movies data set, checked for the presence of NAs, and run using a Zellner-Siony prior and the Bayesian Adaptive Sampling method.

## Select explanatory variables
modeling.data <- movies[,c("aud100","feature_film","drama","runtime",
                           "mpaa_rating_R","thtr_rel_year","oscar_season",
                           "summer_season","imdb_rating","imdb_num_votes",
                           "critics_score100","best_pic_nom","best_pic_win",
                           "best_actor_win","best_actress_win","best_dir_win",
                           "top200_box")]

## Any movies missing data?  
apply(is.na(modeling.data), 2, sum)
##           aud100     feature_film            drama          runtime 
##                0                0                0                0 
##    mpaa_rating_R    thtr_rel_year     oscar_season    summer_season 
##                0                0                0                0 
##      imdb_rating   imdb_num_votes critics_score100     best_pic_nom 
##                0                0                0                0 
##     best_pic_win   best_actor_win best_actress_win     best_dir_win 
##                0                0                0                0 
##       top200_box 
##                0
## Save as data.frame
modeling.data <- as.data.frame(modeling.data)

## Model Selection
set.seed(100)
bas.mod.full <- bas.lm(aud100 ~ ., 
                    data = modeling.data,
                    na.action="na.omit",
                    prior = "ZS-null",
                    modelprior = beta.binomial(1,1),
                    initprobs = "eplogp",
                    method = "BAS",
                    prob.rw = 0.5)

It is concerning that the the ‘imdb rating’ and ‘critics score’ variables are highly correlated with each other (not shown). To avoid an issue with collinearity, the above model was re-run with the imdb_rating dropped.

## Model Selection
set.seed(100)
bas.mod.small <- bas.lm(aud100 ~ . -imdb_rating, 
                    data = modeling.data,
                    na.action="na.omit",
                    prior = "ZS-null",
                    modelprior = beta.binomial(1,1),
                    initprobs = "eplogp",
                    method = "BAS",
                    prob.rw = 0.5)

This improved the spread of the residuals, but also lowered the maximum Log Posterior Odds. It seems statistically dishonest to use a rating system on one site to predict the the rating system of another website. It would be much more useful to try to predict the rotten tomatoes audience score from the attributes of the movies themselves. Despite potentially foregoing the use of a better model, the smaller model was chosen to generate predictions.

model-image-1.png

model-residuals-1.png

The coefficients are displayed below:

## 
##  Marginal Posterior Summaries of Coefficients: 
##                      post mean   post SD     post p(B != 0)
## Intercept             6.236e-01   5.439e-03   1.000e+00    
## feature_filmyes      -9.160e-02   2.554e-02   9.875e-01    
## dramayes              1.423e-02   1.753e-02   4.517e-01    
## runtime               9.066e-06   7.721e-05   3.749e-02    
## mpaa_rating_Ryes     -6.807e-05   1.929e-03   2.816e-02    
## thtr_rel_year        -1.390e-04   3.960e-04   1.398e-01    
## oscar_seasonyes       3.684e-04   3.303e-03   3.546e-02    
## summer_seasonyes     -1.774e-04   2.417e-03   3.037e-02    
## imdb_num_votes        3.220e-07   5.340e-08   1.000e+00    
## critics_score100      4.385e-01   2.357e-02   1.000e+00    
## best_pic_nomyes       3.468e-03   1.529e-02   7.261e-02    
## best_pic_winyes      -1.628e-03   1.416e-02   3.648e-02    
## best_actor_winyes    -3.513e-04   3.469e-03   3.371e-02    
## best_actress_winyes  -1.028e-03   6.002e-03   5.060e-02    
## best_dir_winyes      -4.478e-04   4.779e-03   3.263e-02    
## top200_boxyes        -5.951e-04   7.498e-03   3.095e-02

Interpretation of the coefficients for numerical variables are straightforward.

  • Each additional minute of runtime increases the RT score by 0.000009, or 0.0009%
  • Each additional imdb vote increases the RT score by 0.00000032, or 0.000032%
  • Each additional percentage point increases the RT score by 0.44%
  • Each additional year decreases the RT score 0.00014 or 0.014% (I guess movies improve with age!)

The coefficients for categorical variables are straightforward, but in a different way. Essentially, if the movie has an attribute (denoted by “yes” for the categorical variable), then that coefficient functions as an additional constant that is added to the intercept. In a sense, it’s like an additional intercept term that is “present” when the category is “yes” and absent when the category is “no.”

Part 5: Prediction

The movie ‘Arrival’ was chosen as the new data point to predict the Rotten Tomato score of. The data was looked up manually and compiled into a dataframe. The prediction method used was “BMA” to include as many predictors as possible. It was hoped that the ensemble of weaker predictors would help. It was assumed that “Arrival” would not be nominated for best picture, so this variable was set to “no”.

arrival.movie
##    aud100 feature_film drama runtime mpaa_rating_R thtr_rel_year
## 12   0.83          yes   yes     116            no          2016
##    oscar_season summer_season imdb_rating imdb_num_votes critics_score100
## 12          yes            no         8.3          93676             0.94
##    best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win
## 12           no           no             no               no           no
##    top200_box
## 12         no
yhat_bma = predict(bas.mod.small, newdata=arrival.movie, estimator="BMA")
yhat_bma$fit
##           [,1]
## [1,] 0.7916974

The prediction is 79.2%. This is reasonably close to the actual score of 83%.

Part 6: Conclusion

Bayesian regression performs reasonably well at predicting the RT audience score for a single movie outside of it’s sample data. Most likely, the model can be improved by adjusting the options for bas.lm() and making better choices for what data is used from the movies dataset. The model would most likely improve if there was a variable for “lead actor nominated for an oscar”. The predictive power may have been greater if the imdb rating variable was left in with the rest of the modeling data. This work seems to indicate that feature film/dramas with a large number of votes and a high critic’s score do better than most other movies. Thank you for reading!

Advertisements