Before running this notebook, select “Session > Restart R and Clear Output” in the menu above to start a new R session. This will clear any old data sets and give us a blank slate to start with.
After starting a new session, run the following code chunk to load the libraries and data that we will be working with today.
Over the next few classes we will be working with a dataset of movies I have constructed consisting of the top 100 grossing films for each year from 1970 to 2019. The data comes from IMDb. Today we will focus on getting familiar with the various components of the data. Let’s read in the four tables of data, as well as a data dictionary, and then go through each of the tables.
read_csv(file.path("data", "movies_50_years.csv"))
movies <- read_csv(file.path("data", "movies_50_years_genre.csv"))
m_genre <- read_csv(file.path("data", "movies_50_years_people.csv"))
m_people <- read_csv(file.path("data", "movies_50_years_data_dictionary.csv"))
m_dict <- read_csv(file.path("data", "movies_50_years_color.csv")) m_color <-
The movies dataset contains one row for each movie. Most of the variables are fairly straightforward; you can see the units by opening the data dictionary. Three variables concern the movie’s theatrical poster. These given the average brightness (average pixel intensity), saturation (are the colors bright or washed out / white), and complexity (a cartoon image would have a low complexity; lots of text or icons would have a high complexity).
movies
## # A tibble: 5,000 x 12
## year title mpa runtime gross rating_count rating metacritic
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1970 Love… PG 100 106. 28330 6.9 NA
## 2 1970 Airp… G 137 100. 16512 6.6 42
## 3 1970 MASH R 116 81.6 64989 7.5 NA
## 4 1970 Patt… GP 172 61.7 90461 7.9 NA
## 5 1970 The … G 78 37.7 87551 7.1 NA
## 6 1970 Litt… PG-13 139 31.6 31412 7.6 NA
## 7 1970 Tora… G 144 29.6 30347 7.5 46
## 8 1970 Catc… R 122 24.9 20997 7.2 NA
## 9 1970 The … PG 95 23.7 3107 6.5 NA
## 10 1970 Joe R 107 19.3 2633 6.8 NA
## # … with 4,990 more rows, and 4 more variables: poster_brightness <dbl>,
## # poster_saturation <dbl>, poster_edgeness <dbl>, description <chr>
A second dataset gives more detailed information about each poster by indicating how much of a poster is of a certain color. If you want to look at the movie poster itself, just search for a film on IMDb and search for the film. The poster is the first image on the film’s page.
m_color
## # A tibble: 46,980 x 5
## year title color_type color percentage
## <dbl> <chr> <chr> <chr> <dbl>
## 1 1970 Love Story hue red 2.64
## 2 1970 Love Story hue orange 3.09
## 3 1970 Love Story hue yellow 0.0542
## 4 1970 Love Story hue green 0.236
## 5 1970 Love Story hue blue 0.319
## 6 1970 Love Story hue violet 0.000534
## 7 1970 Love Story greyscale black 11.0
## 8 1970 Love Story greyscale grey 6.91
## 9 1970 Love Story greyscale white 75.8
## 10 1970 Love Story hue other 0
## # … with 46,970 more rows
We also have a dataset of movie genres. The data structure is straightforward, but needs to be kept in its own table because a single movies can be assigned to multiple genres.
m_genre
## # A tibble: 11,887 x 3
## year title genre
## <dbl> <chr> <chr>
## 1 1970 Love Story Drama
## 2 1970 Love Story Romance
## 3 1970 Airport Action
## 4 1970 Airport Drama
## 5 1970 Airport Thriller
## 6 1970 MASH Comedy
## 7 1970 MASH Drama
## 8 1970 MASH War
## 9 1970 Patton Biography
## 10 1970 Patton Drama
## # … with 11,877 more rows
Finally, we also have a dataset of people associated with each film. We do not have a lot of metadata about the people, but I have added a prediction of each person’s gender based on U.S. Social Security records. These are not always correct (there is a confidence score included as well) but are useful for some aggregate analyses.
m_people
## # A tibble: 24,648 x 7
## year title role rank person gender gender_conf
## <dbl> <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 1970 Love Story director 1 Arthur Hiller male 0.994
## 2 1970 Love Story starring 1 Ali MacGraw male 0.688
## 3 1970 Love Story starring 2 Ryan O'Neal male 0.977
## 4 1970 Love Story starring 3 John Marley male 0.996
## 5 1970 Love Story starring 4 Ray Milland male 0.984
## 6 1970 Airport director 1 George Seaton male 0.993
## 7 1970 Airport director 2 Henry Hathaway male 0.994
## 8 1970 Airport starring 1 Burt Lancaster male 1
## 9 1970 Airport starring 2 Dean Martin male 0.988
## 10 1970 Airport starring 3 George Kennedy male 0.993
## # … with 24,638 more rows
There is a lot to unpack with these datasets. We will need to make use of the many methods we have learned so far this semester to make sense of the data.
Start by making a bar plot showing the number of times each genre tag appears in the dataset. No need for fancy labels or titles, but do order the categories from smallest to largest and consider flipping the axes if your screen is too narrow to read the vertical-bar version of the plot.
%>%
m_genre group_by(genre) %>%
summarize(sm_count()) %>%
arrange(desc(count)) %>%
mutate(genre = fct_inorder(genre)) %>%
ggplot(aes(genre, count)) +
geom_col() +
coord_flip()
Now, we are going to do something a bit more complicated. In the chunk below, create a visualization that shows which genre tag is used the most in each year. The plot is easy; creating the dataset will take some work.
%>%
m_genre group_by(year, genre) %>%
summarize(sm_count()) %>%
group_by(year) %>%
arrange(desc(count)) %>%
slice(1) %>%
ggplot(aes(year, genre)) +
geom_point(aes(size = count))
Finally, we are going to create a dataset that has one row for each film that can be merged into the main movies dataset. First, create a dataset that collapses all of the genres for a film into a single value using the function sm_paste
.
%>%
m_genre group_by(year, title) %>%
summarize(sm_paste(genre))
## # A tibble: 4,843 x 3
## # Groups: year [50]
## year title genre_paste
## <dbl> <chr> <chr>
## 1 1970 ...tick... tick... tick... Drama; Action
## 2 1970 A Man Called Horse Adventure; Drama; Western
## 3 1970 A Voyage to Arcturus Fantasy
## 4 1970 Airport Action; Drama; Thriller
## 5 1970 Alex in Wonderland Comedy; Drama
## 6 1970 Angel Unchained Action; Thriller; Drama
## 7 1970 Barquero Western
## 8 1970 Beneath the Planet of the Apes Action; Adventure; Sci-Fi
## 9 1970 Beyond the Valley of the Dolls Comedy; Drama; Music
## 10 1970 Billy Boy Drama; Romance
## # … with 4,833 more rows
And then, create a dataset that associates each film to the “least popular” genre associated with it. For example MASH is both a comedy, drama, and war film. You should have seen above that the “war” tag is much less common than “comedy” or “drama”, so it should be associated with “war”.
%>%
m_genre group_by(genre) %>%
mutate(sm_count()) %>%
arrange(count) %>%
group_by(year, title) %>%
slice(1) %>%
ungroup() %>% # not needed here, but good practice to ungroup when done
select(-count) # also probably do not need `count` anymore, so remove it
## # A tibble: 4,843 x 3
## year title genre
## <dbl> <chr> <chr>
## 1 1970 ...tick... tick... tick... Action
## 2 1970 A Man Called Horse Western
## 3 1970 A Voyage to Arcturus Fantasy
## 4 1970 Airport Thriller
## 5 1970 Alex in Wonderland Comedy
## 6 1970 Angel Unchained Thriller
## 7 1970 Barquero Western
## 8 1970 Beneath the Planet of the Apes Sci-Fi
## 9 1970 Beyond the Valley of the Dolls Music
## 10 1970 Billy Boy Romance
## # … with 4,833 more rows
Assigning the least-popular genre typically gives the most appropriate tag to each movie because it will usually also be the most specific genre.
Start by verifying that the percentage values for each film each add up to 100 (there may be some slight rounding error, but everything should be very close to 100). Use whatever method you find to be the easiest or most reliable, but do not resort to manually checking the values for each film.
%>%
m_color group_by(year, title) %>%
summarize(sm_sum(percentage)) %>%
ungroup() %>%
summarize(sm_min(percentage_sum), sm_max(percentage_sum))
## # A tibble: 1 x 2
## percentage_sum_min percentage_sum_max
## <dbl> <dbl>
## 1 100. 100.
Next, associate each film to the color that is most dominent in each film poster. Plot a count of the most dominant colors using a bar plot. No need for any labels, titles, or other finishing touches, but do try to color the bars according to the associated color name. Note that you will have to use the aesthetic “fill” and scale scale_fill_identity
to do this.
%>%
m_color group_by(year, title) %>%
arrange(desc(percentage)) %>%
slice(1) %>%
group_by(color) %>%
summarize(sm_count()) %>%
ggplot(aes(color, count)) +
geom_col(aes(fill = color), color = "black") +
scale_fill_identity()
Redo the same below, but now include the two most dominant colors. Note any patterns that you see in relation to the first plot. Note: this should be any easy tweak of your last plot.
%>%
m_color group_by(year, title) %>%
arrange(desc(percentage)) %>%
slice(1:2) %>%
group_by(color) %>%
summarize(sm_count()) %>%
ggplot(aes(color, count)) +
geom_col(aes(fill = color), color = "black") +
scale_fill_identity()