# Class 13: Working with Categorical Data

### Learning Objectives

Today we will explore how to create new variable from old variables and, specifically, how to change the way that categorical variables are presented in plots.

### Tea Data

For class today, let’s load a dataset of teas offered by the website Adagio:

### New Variables

In R, we can create new variable from old ones by apply numeric operations or new functions. In plots, simply manipulations can be do in-line; that is, we apply the functions to the variables within the plot. For example, the tea dataset gives prices in cents. We can make a plot of price in dollars against the score as follows:

Notice that the expression shows up verbatim in the plot. We can apply other functions such as sqrt or combine two variables similarly (note: this makes no practical sense here):

If a new variable is particularly useful or complex to construct, it may be useful to create a new variable to store it. The syntax to do this is as follows:

Notice that we need to start every variable name with tea\$; otherwise R will not know which dataset we are working with. In ggplot2 commands this is not a problem because we have already stated what the default dataset is.

### Making Numeric Data Discrete

Often in plots it will be useful to convert numeric data into categorical data. There are three functions that I typically use to do this, depending on the end-goal:

• factor: this converts each unique value of the input into its own category
• cut: breaks the range of the numeric variable into equal parts and combines numbers in the same range together
• bin: breaks the numeric data into equally sized bins

The second two require an option named n that specifies the number of buckets.

Let’s take a look at how this works for factor:

Cut with 5 bins:

And bin with 5 bins:

You may find these useful, for one thing, when making maps in your second project.

### Changing Categorical Variables

The package forcats provides a number of functions for changing the way that categories are displayed. There are a number of functions, but I find that these four are most useful:

• fct_inorder: order the categories in the order the categories appear
• fct_infreq: order the categories from the smallest to largest category
• fct_rec: reverse the order of the categories (useful to apply after fct_infreq)
• fct_lump: lump together the smallest categories. Set the option n to specify the number of remaining categories

We can see the effect of these most clearly on a bar plot, such as:

Or

They are very useful for when you want to use color but have too many small categories:

### Practice

We will, once again, work on a lab for the remainder of the class: lab13.Rmd Upload your script to GitHub ahead of the next class.