Getting Started

Before running this notebook, select “Session > Restart R and Clear Output” in the menu above to start a new R session. This will clear any old data sets and give us a blank slate to start with.

After starting a new session, run the following code chunk to load the libraries and data that we will be working with today.

Movies Data

Over the next few classes we will be working with a dataset of movies I have constructed consisting of the top 100 grossing films for each year from 1970 to 2019. The data comes from IMDb. Today we will focus on getting familiar with the various components of the data. Let’s read in the four tables of data, as well as a data dictionary, and then go through each of the tables.

movies <- read_csv(file.path("data", "movies_50_years.csv"))
m_genre <- read_csv(file.path("data", "movies_50_years_genre.csv"))
m_people <- read_csv(file.path("data", "movies_50_years_people.csv"))
m_dict <- read_csv(file.path("data", "movies_50_years_data_dictionary.csv"))
m_color <- read_csv(file.path("data", "movies_50_years_color.csv"))

The movies dataset contains one row for each movie. Most of the variables are fairly straightforward; you can see the units by opening the data dictionary. Three variables concern the movie’s theatrical poster. These given the average brightness (average pixel intensity), saturation (are the colors bright or washed out / white), and complexity (a cartoon image would have a low complexity; lots of text or icons would have a high complexity).

movies
## # A tibble: 5,000 x 12
##     year title mpa   runtime gross rating_count rating metacritic
##    <dbl> <chr> <chr>   <dbl> <dbl>        <dbl>  <dbl>      <dbl>
##  1  1970 Love… PG        100 106.         28330    6.9         NA
##  2  1970 Airp… G         137 100.         16512    6.6         42
##  3  1970 MASH  R         116  81.6        64989    7.5         NA
##  4  1970 Patt… GP        172  61.7        90461    7.9         NA
##  5  1970 The … G          78  37.7        87551    7.1         NA
##  6  1970 Litt… PG-13     139  31.6        31412    7.6         NA
##  7  1970 Tora… G         144  29.6        30347    7.5         46
##  8  1970 Catc… R         122  24.9        20997    7.2         NA
##  9  1970 The … PG         95  23.7         3107    6.5         NA
## 10  1970 Joe   R         107  19.3         2633    6.8         NA
## # … with 4,990 more rows, and 4 more variables: poster_brightness <dbl>,
## #   poster_saturation <dbl>, poster_edgeness <dbl>, description <chr>

A second dataset gives more detailed information about each poster by indicating how much of a poster is of a certain color. If you want to look at the movie poster itself, just search for a film on IMDb and search for the film. The poster is the first image on the film’s page.

m_color
## # A tibble: 46,980 x 5
##     year title      color_type color  percentage
##    <dbl> <chr>      <chr>      <chr>       <dbl>
##  1  1970 Love Story hue        red      2.64    
##  2  1970 Love Story hue        orange   3.09    
##  3  1970 Love Story hue        yellow   0.0542  
##  4  1970 Love Story hue        green    0.236   
##  5  1970 Love Story hue        blue     0.319   
##  6  1970 Love Story hue        violet   0.000534
##  7  1970 Love Story greyscale  black   11.0     
##  8  1970 Love Story greyscale  grey     6.91    
##  9  1970 Love Story greyscale  white   75.8     
## 10  1970 Love Story hue        other    0       
## # … with 46,970 more rows

We also have a dataset of movie genres. The data structure is straightforward, but needs to be kept in its own table because a single movies can be assigned to multiple genres.

m_genre
## # A tibble: 11,887 x 3
##     year title      genre    
##    <dbl> <chr>      <chr>    
##  1  1970 Love Story Drama    
##  2  1970 Love Story Romance  
##  3  1970 Airport    Action   
##  4  1970 Airport    Drama    
##  5  1970 Airport    Thriller 
##  6  1970 MASH       Comedy   
##  7  1970 MASH       Drama    
##  8  1970 MASH       War      
##  9  1970 Patton     Biography
## 10  1970 Patton     Drama    
## # … with 11,877 more rows

Finally, we also have a dataset of people associated with each film. We do not have a lot of metadata about the people, but I have added a prediction of each person’s gender based on U.S. Social Security records. These are not always correct (there is a confidence score included as well) but are useful for some aggregate analyses.

m_people
## # A tibble: 24,648 x 7
##     year title      role      rank person         gender gender_conf
##    <dbl> <chr>      <chr>    <dbl> <chr>          <chr>        <dbl>
##  1  1970 Love Story director     1 Arthur Hiller  male         0.994
##  2  1970 Love Story starring     1 Ali MacGraw    male         0.688
##  3  1970 Love Story starring     2 Ryan O'Neal    male         0.977
##  4  1970 Love Story starring     3 John Marley    male         0.996
##  5  1970 Love Story starring     4 Ray Milland    male         0.984
##  6  1970 Airport    director     1 George Seaton  male         0.993
##  7  1970 Airport    director     2 Henry Hathaway male         0.994
##  8  1970 Airport    starring     1 Burt Lancaster male         1    
##  9  1970 Airport    starring     2 Dean Martin    male         0.988
## 10  1970 Airport    starring     3 George Kennedy male         0.993
## # … with 24,638 more rows

There is a lot to unpack with these datasets. We will need to make use of the many methods we have learned so far this semester to make sense of the data.

Movie Genre

Start by making a bar plot showing the number of times each genre tag appears in the dataset. No need for fancy labels or titles, but do order the categories from smallest to largest and consider flipping the axes if your screen is too narrow to read the vertical-bar version of the plot.

m_genre %>%
  group_by(genre) %>%
  summarize(sm_count()) %>%
  arrange(desc(count)) %>%
  mutate(genre = fct_inorder(genre)) %>%
  ggplot(aes(genre, count)) +
    geom_col() +
    coord_flip()

Now, we are going to do something a bit more complicated. In the chunk below, create a visualization that shows which genre tag is used the most in each year. The plot is easy; creating the dataset will take some work.

m_genre %>%
  group_by(year, genre) %>%
  summarize(sm_count()) %>%
  group_by(year) %>%
  arrange(desc(count)) %>%
  slice(1) %>%
  ggplot(aes(year, genre)) +
    geom_point(aes(size = count))

Finally, we are going to create a dataset that has one row for each film that can be merged into the main movies dataset. First, create a dataset that collapses all of the genres for a film into a single value using the function sm_paste.

m_genre %>%
  group_by(year, title) %>%
  summarize(sm_paste(genre))
## # A tibble: 4,843 x 3
## # Groups:   year [50]
##     year title                          genre_paste              
##    <dbl> <chr>                          <chr>                    
##  1  1970 ...tick... tick... tick...     Drama; Action            
##  2  1970 A Man Called Horse             Adventure; Drama; Western
##  3  1970 A Voyage to Arcturus           Fantasy                  
##  4  1970 Airport                        Action; Drama; Thriller  
##  5  1970 Alex in Wonderland             Comedy; Drama            
##  6  1970 Angel Unchained                Action; Thriller; Drama  
##  7  1970 Barquero                       Western                  
##  8  1970 Beneath the Planet of the Apes Action; Adventure; Sci-Fi
##  9  1970 Beyond the Valley of the Dolls Comedy; Drama; Music     
## 10  1970 Billy Boy                      Drama; Romance           
## # … with 4,833 more rows

And then, create a dataset that associates each film to the “least popular” genre associated with it. For example MASH is both a comedy, drama, and war film. You should have seen above that the “war” tag is much less common than “comedy” or “drama”, so it should be associated with “war”.

m_genre %>%
  group_by(genre) %>%
  mutate(sm_count()) %>%
  arrange(count) %>%
  group_by(year, title) %>%
  slice(1) %>%
  ungroup() %>%   # not needed here, but good practice to ungroup when done
  select(-count)  # also probably do not need `count` anymore, so remove it
## # A tibble: 4,843 x 3
##     year title                          genre   
##    <dbl> <chr>                          <chr>   
##  1  1970 ...tick... tick... tick...     Action  
##  2  1970 A Man Called Horse             Western 
##  3  1970 A Voyage to Arcturus           Fantasy 
##  4  1970 Airport                        Thriller
##  5  1970 Alex in Wonderland             Comedy  
##  6  1970 Angel Unchained                Thriller
##  7  1970 Barquero                       Western 
##  8  1970 Beneath the Planet of the Apes Sci-Fi  
##  9  1970 Beyond the Valley of the Dolls Music   
## 10  1970 Billy Boy                      Romance 
## # … with 4,833 more rows

Assigning the least-popular genre typically gives the most appropriate tag to each movie because it will usually also be the most specific genre.

Movie Color

Start by verifying that the percentage values for each film each add up to 100 (there may be some slight rounding error, but everything should be very close to 100). Use whatever method you find to be the easiest or most reliable, but do not resort to manually checking the values for each film.

m_color %>%
  group_by(year, title) %>%
  summarize(sm_sum(percentage)) %>%
  ungroup() %>%
  summarize(sm_min(percentage_sum), sm_max(percentage_sum))
## # A tibble: 1 x 2
##   percentage_sum_min percentage_sum_max
##                <dbl>              <dbl>
## 1               100.               100.

Next, associate each film to the color that is most dominent in each film poster. Plot a count of the most dominant colors using a bar plot. No need for any labels, titles, or other finishing touches, but do try to color the bars according to the associated color name. Note that you will have to use the aesthetic “fill” and scale scale_fill_identity to do this.

m_color %>%
  group_by(year, title) %>%
  arrange(desc(percentage)) %>%
  slice(1) %>%
  group_by(color) %>%
  summarize(sm_count()) %>%
  ggplot(aes(color, count)) +
    geom_col(aes(fill = color), color = "black") +
    scale_fill_identity()

Redo the same below, but now include the two most dominant colors. Note any patterns that you see in relation to the first plot. Note: this should be any easy tweak of your last plot.

m_color %>%
  group_by(year, title) %>%
  arrange(desc(percentage)) %>%
  slice(1:2) %>%
  group_by(color) %>%
  summarize(sm_count()) %>%
  ggplot(aes(color, count)) +
    geom_col(aes(fill = color), color = "black") +
    scale_fill_identity()