Class 06: Bikes (and Matrices!)

2017-09-14

library(readr)
library(ggplot2)
library(dplyr)

Linear Models with Matrices in R

Today, we will look at a dataset of bike sharing usage in Washington, DC. Our task is to predict how many mikes are rented each day:

bikes <- read_csv("https://statsmaths.github.io/ml_data/bikes.csv")

We are going to fit linear models to the dataset, but I want to modify how we build the models. Specifically, I want to manually construct the data matrix rather than having it made behind the scenes by R.

Matrices

Linear Maps

Consider all functions from an n-dimensional space into an m-dimensional space,

That preserves addition,

And multiplication by a fixed scalar value,

Such functions are known as linear maps (the definition can be abstracted to infinite dimensional spaces and over other fields, but we will only need the finite real-valued case in this course).

There is an important representation theorem stating that every such mapping can be described by an n-by-m array of real numbers. This array is commonly called a matrix. How exactly do we calculate the function f given its matrix representation A? It is easier to explain with fixed values for n and m. Let’s consider the following matrix:

So, in this case we have n = 3 and m = 2. Therefore this is a mapping that takes a triple of numbers and returns a pair of numbers. Let’s define the input as a vector u:

And the output as a vector v:

The linear map is then defined as:

So, each component of v is a linear combination of all components of u. We can write this compactly using summation notation:

Conveniently, this last equation holds for any arbitrary choice of n and m. Finally, we represent this symbolically by

Function composition

Consider two linear maps:

If we apply f to an vector in n-dimensional space we get an m-dimensional vector. We could then take the output of this map and apply g to it in order to get a p-dimensional vector. This is known a function composition. We can represent the action of first applying f and then applying g as a new function h:

It is a fairly intuitive and easy to prove result that if f and g are linear maps, so is h. Let f be represented by the matrix A, g by the matrix B, and h by the matrix C. A natural question is what relationship exists between A, B, and C?

It turns out that the result is just another sum:

This is known as a matrix product and is written as:

If the concept of matrix multiplication is new to you, an animated visualization of multiplying two matricies can be useful. Of course, for this semester understanding the abstract meaning behind matrix multiplication (function composition) is much more important than grasping the mechanics of computing the new matrix:

Notice that if we represent vectors as one-column matricies, the definition of matrix multiplication is equivalent to our defintion of applying a matrix to a vector:

Matrices in R

Basics

There is extensive functionality for working with matricies in R. Let’s make a random 5-by-5 matrix and a 5-by-1 matrix, the latter we can think of as a vector.

A <- matrix(sample(1:99, 25), 5, 5)
A
##      [,1] [,2] [,3] [,4] [,5]
## [1,]   36   10   71   31   99
## [2,]   27   98   48   84   87
## [3,]   91   80   62   51   75
## [4,]   97    4   47   79   78
## [5,]   45   66   74   56   86
b <- matrix(sample(1:99, 5))
b
##      [,1]
## [1,]    9
## [2,]   38
## [3,]   55
## [4,]   18
## [5,]   78

Element-wise arithmetic is assumed by default:

A + A
##      [,1] [,2] [,3] [,4] [,5]
## [1,]   72   20  142   62  198
## [2,]   54  196   96  168  174
## [3,]  182  160  124  102  150
## [4,]  194    8   94  158  156
## [5,]   90  132  148  112  172

To calculate a matrix produce, R requires us to use the symbol %*%. If a matrix is square, there is nothing stopping us from composing a matrix with itself:

A %*% A
##       [,1]  [,2]  [,3]  [,4]  [,5]
## [1,] 15489 13678 16221 13570 20691
## [2,] 20049 19792 19983 23025 28833
## [3,] 19400 18864 22092 20932 31047
## [4,] 19050 10586 19478 16349 26346
## [5,] 19438 18738 19947 19953 27511

Similarly, we can multiply the matrix by the column vector to compute the action of the matrix A as a linear map:

A %*% b
##       [,1]
## [1,] 12889
## [2,] 14905
## [3,] 14037
## [4,] 11116
## [5,] 14699

The transpose of a matrix, denoted by using t as a superscript, can be computed with the t function:

t(A)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]   36   27   91   97   45
## [2,]   10   98   80    4   66
## [3,]   71   48   62   47   74
## [4,]   31   84   51   79   56
## [5,]   99   87   75   78   86

This is often useful in matrix computations. Similarly, we can computing the inverse of a matrix with the function solve:

solve(A)
##               [,1]         [,2]        [,3]         [,4]         [,5]
## [1,]  0.0003860159 -0.003961447  0.01456756  0.003181661 -0.012026820
## [2,] -0.0014545334  0.006055068  0.01087417 -0.008429623 -0.006288891
## [3,] -0.0207098497 -0.026338806 -0.01832479  0.002371943  0.064315102
## [4,] -0.0184174969  0.001098155 -0.02054939  0.014111199  0.025213062
## [5,]  0.0307271768  0.019374483  0.01318098 -0.006425239 -0.049011341

A matrix inverse, by definition, describes the inverse of the underlying linear map (its also relatively simple to show that linear maps have linear inverses, when they exist). Note that matrix inversion is a computationally unstable procedure and should be avoided when possible.

Subsetting matricies

It will often be useful to take a subst of the rows and columns in a matrix. Here we take rows 2 and 3 and columns 1 and 2:

A[2:3, 1:2]
##      [,1] [,2]
## [1,]   27   98
## [2,]   91   80

Here we take columns 1 and 2; by leaving the rows component empty every row is returned:

A[,1:2]
##      [,1] [,2]
## [1,]   36   10
## [2,]   27   98
## [3,]   91   80
## [4,]   97    4
## [5,]   45   66

There is a strange convention in R that, by default, if we select a sub-matrix with only one row or one column, the result will be converted from a rectangular matrix to a non-dimensional vector. Notice that the output below is not given as a matrix with one column:

A[,1]
## [1] 36 27 91 97 45

Usually this is not a problem, but if we want to be safe we can add the option drop = FALSE to the subset command. Notice the difference in the output here compared to the output above:

A[,1,drop = FALSE]
##      [,1]
## [1,]   36
## [2,]   27
## [3,]   91
## [4,]   97
## [5,]   45

Finally, we can also subset by giving a logical statement in either the rows or columns spot. Here we select only those rows where the numbers 1 through 5 are greater than 3:

A[1:5 > 3,]
##      [,1] [,2] [,3] [,4] [,5]
## [1,]   97    4   47   79   78
## [2,]   45   66   74   56   86

As expected, the output is a matrix with just two rows.

Multivariate Linear Models

Matrix Formulation

We have been working with multivariate linear models over the past few classes, though I have only ever written the formal equation in the case where there are two explanatory variables. In general, multivariate regression represents the following model:

For simplicity, we won’t include an explicit intercept term in the model. If we want one, we will just make the first variable $x_{1,i}$ equal to one for every value of i.

The statistical estimation problem is to estimate the p components of the multivariate vector beta.

Using our matrix notation, we can write the linear model simultaneously for all observations:

Which can be compactly written as:

The matrix X is known as the design matrix or model matrix. For reference, note the following equation yields these dimensions:

Fitting Linear Models with Matricies

Note that the bikes data, like every other dataset we have used, is not a matrix. It is something that R calls a data frame:

class(bikes)
## [1] "tbl_df"     "tbl"        "data.frame"

While both matrices and data frames have data organized in rows and columns, matrices force all of the data to be numeric (its actually more complicated than this in R, but just go along with it for now). A data frame on the other hand can have different variable types in each column.

An easy way to construct a matrix in R, is to use the select function to grab only numeric columns and the function as.matrix to convert the output into a matrix object. We will throughout the course use the notation that the variable y is a vector holding the response of interest and X is a matrix containing the variables we want to use in a model.

y <- bikes$count

X <- as.matrix(select(bikes, temp, humidity))
X[1:10,]
##            temp humidity
##  [1,]  6.715022 0.805833
##  [2,]  7.989548 0.696087
##  [3,] -3.039976 0.437273
##  [4,] -2.800000 0.590435
##  [5,] -1.020838 0.436957
##  [6,] -2.513032 0.518261
##  [7,] -3.029548 0.498696
##  [8,] -5.110000 0.535833
##  [9,] -6.870022 0.434167
## [10,] -6.045022 0.482917

We can then create specific training and validation sets using the logical subset method from above:

X_train <- X[bikes$train_id == "train", ]
X_valid <- X[bikes$train_id == "valid", ]
y_train <- y[bikes$train_id == "train"]
y_valid <- y[bikes$train_id == "valid"]

So this generally looks good, but we left out the intercept. Recall that our matrix formulation required us to add an explicit column of 1’s if we wanted an intercept. Let’s do this directly in the matrix X using the cbind function, and repeat:

X <- as.matrix(select(bikes, temp, humidity))
X <- cbind(1, X)
X_train <- X[bikes$train_id == "train", ]
X_valid <- X[bikes$train_id == "valid", ]
head(X_train)
##             temp humidity
## [1,] 1  6.715022 0.805833
## [2,] 1  7.989548 0.696087
## [3,] 1 -3.039976 0.437273
## [4,] 1 -1.020838 0.436957
## [5,] 1 -6.045022 0.482917
## [6,] 1 -4.839994 0.686364

Using lm.fit

We have seen how to use the lm function to quickly fit linear regression directly from data frames. Now, how do we actually fit a linear model once we have these matricies? Next class we will see how to do this directly with matrix operations. There is an intermediate function that solves the linear regression problem directly from our matricies called lm.fit. In fact, the lm function internally calls this function. As inputs, it takes just the X matrix and response y. There is a lot of diagnostic output, but we will take just the coef component, corresponding to the coefficents matrix:

beta <- lm.fit(X_train, y_train)$coef
beta
##                  temp   humidity 
##  4462.0106   107.5505 -2702.7323

We can create predicted values for the whole dataset by matrix multiplying X with beta:

bikes$count_pred <- X %*% beta

Let’s verify that this gives the same output as the lm function:

model <- lm(count ~ temp +  humidity, data = bikes,
            subset = train_id == "train")
model
## 
## Call:
## lm(formula = count ~ temp + humidity, data = bikes, subset = train_id == 
##     "train")
## 
## Coefficients:
## (Intercept)         temp     humidity  
##      4462.0        107.6      -2702.7

And, as hoped, it does!

Another (The Best!) Way to Make Model Matricies

When we simply want to set the model matrix X to a subset of the numeric columns in our data frame, the function as.matrix is usually sufficient. The formula interface to lm is incredibly useful however when using categorical variables or when processing numeric variables by special functions such as poly.

The function model.matrix allows us to compute the model matrix from the formula interface. In fact, the lm function calls this to convert our inputs into a model matrix. The output of this is then passed to lm.fit. It will also, by default, include an intercept term for us. Here we use it to build the same model matrix as before:

X <- model.matrix(~ temp +  humidity , data = bikes)
head(X)
##   (Intercept)      temp humidity
## 1           1  6.715022 0.805833
## 2           1  7.989548 0.696087
## 3           1 -3.039976 0.437273
## 4           1 -2.800000 0.590435
## 5           1 -1.020838 0.436957
## 6           1 -2.513032 0.518261

Notice that the intercept has been added for us. What is nice about this formulation is that we can use commands like poly and factor (or just raw character vectors) and have R take care of all the hard work for us:

X <- model.matrix(~ poly(humidity, degree = 2) + factor(season),
                  data = bikes)
head(X)
##   (Intercept) poly(humidity, degree = 2)1 poly(humidity, degree = 2)2
## 1           1                 0.046239257                  0.01733700
## 2           1                 0.017720634                 -0.01966006
## 3           1                -0.049534837                  0.01868123
## 4           1                -0.009734122                 -0.02531409
## 5           1                -0.049616953                  0.01883585
## 6           1                -0.028489275                 -0.01227834
##   factor(season)Spring factor(season)Summer factor(season)Winter
## 1                    0                    0                    1
## 2                    0                    0                    1
## 3                    0                    0                    1
## 4                    0                    0                    1
## 5                    0                    0                    1
## 6                    0                    0                    1

We will begin next class by showing some best practices for how to build model matrices. At that point we will be able to apply more models that require us to put data into a matrix format.

Indicator variables

Notice that each of the season variables is an indicator for which season of the year each day is. Let’s look at the first 6 carriers to verify this:

head(bikes$season)
## [1] "Winter" "Winter" "Winter" "Winter" "Winter" "Winter"

If we look at a table of all the carriers, we see that one of the carriers is missing:

table(bikes$season)
## 
##   Fall Spring Summer Winter 
##    178    184    188    181

The missing value “Fall” is known as the baseline in this model. The intercept indicates the expected rentals for a day in the Fall. All of the other terms give how much more or less each season has more or less expected rentals relative to the baseline “Fall”.

Changing the baseline will change all of the beta coefficients. However, the predicted values will remain the same. In 209 and 289, I spend a lot of time talking about changing the baseline and understanding the model from different perspectives. As we are focused on prediction, which is unchanged, this will be much less of a concern for us. Just note that by default variables are sorted in alphabetical order, which is why “Fall” is the baseline here.