I wanted to share a brief note about the connection between linear regression and principal components. In a linear regression, our goal is to find a linear combination of the columns of our constructed data matrix. Mathematically we have:

\[ z_i = x_{i, 1} \cdot \beta_1 + x_{i, 2} \cdot \beta_2 + \cdots + x_{i, p} \cdot \beta_p = \sum_j x_{i, j} \cdot \beta_j = x^t \beta \]

The goal is to find a combination for which the predictions z are as close as possible to some supervising variable that we are trying to predict. More specifically, we want to find the beta vector that minimizes the sum of squared residuals:

\[ \sum_i \left( z_i - y_i \right)^2 \]

In PCA, each principal component is also defined as a linear combination of the columns of the data matrix X. To be explicit, we have:

\[ z_i = x_{i, 1} \cdot \beta_1 + x_{i, 2} \cdot \beta_2 + \cdots + x_{i, p} \cdot \beta_p = \sum_j x_{i, j} \cdot \beta_j = x^t \beta \]

However, we have not supervising variable to determine what a good set of z’s are. Instead, our goal is to maximize the variation of z (yes, maximize; that’s not a typo). So, we want to maximize the quantity:

\[ \sum_i \left( z_i - \bar{z} \right)^2 \]

Of course, we could do this by making the scale of z very large. To avoid this, and force PCA to produce meaningful quantities, we also restrict the size of the beta vector to have an Euclidean norm of one.

The other principal components are defined the same way, but with the additional requirement that the next beta vector must be perpendicular to all previous beta vectors.