Before running this notebook, select “Session > Restart R and Clear Output” in the menu above to start a new R session. This will clear any old data sets and give us a blank slate to start with.
After starting a new session, run the following code chunk to load the libraries and data that we will be working with today.
Over the next few classes we will be working with a dataset of movies I have constructed consisting of the top 100 grossing films for each year from 1970 to 2019. The data comes from IMDb. Today we will focus on getting familiar with the various components of the data. Let’s read in the four tables of data, as well as a data dictionary, and then go through each of the tables.
read_csv(file.path("data", "movies_50_years.csv")) movies <- read_csv(file.path("data", "movies_50_years_genre.csv")) m_genre <- read_csv(file.path("data", "movies_50_years_people.csv")) m_people <- read_csv(file.path("data", "movies_50_years_data_dictionary.csv")) m_dict <- read_csv(file.path("data", "movies_50_years_color.csv"))m_color <-
See notebook11 and the data dictionary for more information about the available variables.
Summarize the average number of people listed as starring in a film for each year (first count the number per film and then take the average), and plot the pattern over the 50 years of data that we have available to us. Do you notice anything strange about the dataset?
%>% m_people filter(role == "starring") %>% group_by(year, title) %>% summarize(sm_count()) %>% summarize(sm_mean(count)) %>% ggplot(aes(year, count_mean)) + geom_point() + geom_line()
Filter the data to those names where the gender confidence score is less than 0.6. Note any patterns that you see and consider caveats that any gender-based analysis on the larger dataset should consider.
%>% m_people filter(gender_conf < 0.6)
## # A tibble: 206 x 7 ## year title role rank person gender gender_conf ## <dbl> <chr> <chr> <dbl> <chr> <chr> <dbl> ## 1 1970 Le casse de l'oncle Tom direct… 1 Ossie Dav… female 0.521 ## 2 1970 The Hawaiians starri… 4 Mako female 0.514 ## 3 1970 How Do I Love Thee? starri… 1 Jackie Gl… female 0.554 ## 4 1970 Multiple Maniacs starri… 1 Divine female 0.569 ## 5 1970 Scent of Love starri… 4 Casey Lar… male 0.575 ## 6 1971 Nympho Cycler starri… 1 Casey Lar… male 0.575 ## 7 1971 The Secretary starri… 3 Jeryl Hen… female 0.593 ## 8 1972 Pink Flamingos starri… 1 Divine female 0.569 ## 9 1972 The Gospel According to… starri… 1 Blair Zyk… male 0.503 ## 10 1972 The Young Rounders direct… 1 Casey Tib… male 0.575 ## # … with 196 more rows
Now, make a plot showing the number of films starring the 20 most prolific actors with the bars filled according to an actor’s gender. Note, consider grouping the data by both gender and person before doing the summarization.
%>% m_people filter(role == "starring") %>% group_by(gender, person) %>% summarize(sm_count()) %>% arrange(desc(count)) %>% ungroup() %>% slice(1:20) %>% ggplot(aes(person, count)) + geom_col(aes(fill = gender)) + coord_flip()
You will (hopefully) notice something strange in the plot above. Fix this by only including actors with a high gender confidence score (above 0.95 perhaps?).
%>% m_people filter(role == "starring") %>% filter(gender_conf > 0.95) %>% group_by(gender, person) %>% summarize(sm_count()) %>% arrange(desc(count)) %>% ungroup() %>% slice(1:20) %>% ggplot(aes(person, count)) + geom_col(aes(fill = gender)) + coord_flip()