Getting Started

Before running this notebook, select “Session > Restart R and Clear Output” in the menu above to start a new R session. This will clear any old data sets and give us a blank slate to start with.

After starting a new session, run the following code chunk to load the libraries and data that we will be working with today.

Movies Data

Over the next few classes we will be working with a dataset of movies I have constructed consisting of the top 100 grossing films for each year from 1970 to 2019. The data comes from IMDb. Today we will focus on getting familiar with the various components of the data. Let’s read in the four tables of data, as well as a data dictionary, and then go through each of the tables.

movies <- read_csv(file.path("data", "movies_50_years.csv"))
m_genre <- read_csv(file.path("data", "movies_50_years_genre.csv"))
m_people <- read_csv(file.path("data", "movies_50_years_people.csv"))
m_dict <- read_csv(file.path("data", "movies_50_years_data_dictionary.csv"))
m_color <- read_csv(file.path("data", "movies_50_years_color.csv"))

See notebook11 and the data dictionary for more information about the available variables.

Movie People

Summarize the average number of people listed as starring in a film for each year (first count the number per film and then take the average), and plot the pattern over the 50 years of data that we have available to us. Do you notice anything strange about the dataset?

m_people %>%
  filter(role == "starring") %>%
  group_by(year, title) %>%
  summarize(sm_count()) %>%
  summarize(sm_mean(count)) %>%
  ggplot(aes(year, count_mean)) +
    geom_point() +
    geom_line()

Filter the data to those names where the gender confidence score is less than 0.6. Note any patterns that you see and consider caveats that any gender-based analysis on the larger dataset should consider.

m_people %>%
  filter(gender_conf < 0.6)
## # A tibble: 206 x 7
##     year title                    role     rank person     gender gender_conf
##    <dbl> <chr>                    <chr>   <dbl> <chr>      <chr>        <dbl>
##  1  1970 Le casse de l'oncle Tom  direct…     1 Ossie Dav… female       0.521
##  2  1970 The Hawaiians            starri…     4 Mako       female       0.514
##  3  1970 How Do I Love Thee?      starri…     1 Jackie Gl… female       0.554
##  4  1970 Multiple Maniacs         starri…     1 Divine     female       0.569
##  5  1970 Scent of Love            starri…     4 Casey Lar… male         0.575
##  6  1971 Nympho Cycler            starri…     1 Casey Lar… male         0.575
##  7  1971 The Secretary            starri…     3 Jeryl Hen… female       0.593
##  8  1972 Pink Flamingos           starri…     1 Divine     female       0.569
##  9  1972 The Gospel According to… starri…     1 Blair Zyk… male         0.503
## 10  1972 The Young Rounders       direct…     1 Casey Tib… male         0.575
## # … with 196 more rows

Now, make a plot showing the number of films starring the 20 most prolific actors with the bars filled according to an actor’s gender. Note, consider grouping the data by both gender and person before doing the summarization.

m_people %>%
  filter(role == "starring") %>%
  group_by(gender, person) %>%
  summarize(sm_count()) %>%
  arrange(desc(count)) %>%
  ungroup() %>%
  slice(1:20) %>%
  ggplot(aes(person, count)) +
    geom_col(aes(fill = gender)) +
    coord_flip()

You will (hopefully) notice something strange in the plot above. Fix this by only including actors with a high gender confidence score (above 0.95 perhaps?).

m_people %>%
  filter(role == "starring") %>%
  filter(gender_conf > 0.95) %>%
  group_by(gender, person) %>%
  summarize(sm_count()) %>%
  arrange(desc(count)) %>%
  ungroup() %>%
  slice(1:20) %>%
  ggplot(aes(person, count)) +
    geom_col(aes(fill = gender)) +
    coord_flip()