# Momentum
## Generate data
To illustrate the use of momentum in optimization algorithms, let us
try to use it to solve for the ordinary least squares solution in
linear regression. I will first generate a small test set of data:
```{r}
set.seed(42)
n <- 1000
p <- 10
X <- matrix(rnorm(n*p),ncol=p)
beta <- 1 / 1:p
y <- X %*% beta + rnorm(n, sd=0.5)
```
The true solution, which in this case we can calculate analytically,
is given by directly solving the normal equations. We can get these
in R by the following:
```{r}
betaOls <- coef(lm(y ~ X - 1))
```
## Standard gradient descent
Standard gradient descent calculates the entire gradient function and
moves some amount in the opposite direction of the gradient. Here we run this
algorithm for 20 iterations and plot the error on the log scale:
```{r}
eta <- 0.01
b <- rep(0, p)
E <- 100
err <- rep(NA, E)
for (k in 1:E) {
gradFull <- (2/n) * ((t(X) %*% X) %*% b - t(X) %*% y)
v <- -1 * eta * gradFull
b <- b + v
err[k] <- max(abs(b - betaOls))
}
plot(err, log="y", type="l", xlab="iteration")
grid()
```
Notice that the log error rate is decreasing almost exactly as a linear
function of the iteration. We see that the error is still relatively high
after 100 iterations.
## Newton's method
Newton's method uses a quadratic approximation to the function to estimate
not just which direction to move in but how far to move in that direction.
It is often augmented with a learning rate, though this is much different
than the rate in the gradient descent algorithm because in theory it should
converge under some weak assumptions even when the learning rate is 1.
```{r}
eta <- 0.01
v <- 0
gamma <- 0.5
b <- rep(0, p)
E <- 100
err <- rep(NA, E)
for (k in 1:E) {
gradFull <- (2/n) * ((t(X) %*% X) %*% b - t(X) %*% y)
hessFull <- -(2/n) * t(X) %*% X
v <- gamma * solve(hessFull) %*% gradFull
b <- b + v
err[k] <- max(abs(b - betaOls))
}
plot(err, log="y", type="l", xlab="iteration")
grid()
```
Because the linear regression problem is itself a quadratic form, using a
learning rate of 1 would solve the problem exactly. We see that even setting
it to 0.5 converges to double precision within 50 steps.
## Basic momentum
What if we approximate the hessian information by using basic momentum, with
a mu of 0.8?
```{r}
eta <- 0.01
v <- 0
mu <- 0.8
b <- rep(0, p)
E <- 100
err <- rep(NA, E)
for (k in 1:E) {
gradFull <- (2/n) * ((t(X) %*% X) %*% b - t(X) %*% y)
v <- v * mu - eta * gradFull
b <- b + v
err[k] <- max(abs(b - betaOls))
}
plot(err, log="y", type="l", xlab="iteration")
grid()
```
Though slightly noisier at some points along the curve, this gives a drastically
better convergence rate!
## Nesterov's accelerated gradient (i.e., momentum)
Finally, we'll use Nesterov's trick to approximate the moment instead:
```{r}
eta <- 0.01
v <- 0
mu <- 0.8
b <- rep(0, p)
E <- 100
err <- rep(NA, E)
for (k in 1:E) {
bhalf <- b + v * mu
gradFull <- (2/n) * ((t(X) %*% X) %*% bhalf - t(X) %*% y)
v <- v * mu - eta * gradFull
b <- b + v
err[k] <- max(abs(b - betaOls))
}
plot(err, log="y", type="l", xlab="iteration")
grid()
```
The benefit here is slight, but still noticeable. For noisier functions, where
the gradient is changing rapidly, the differences will become more pronounced.
## Example of SGD
Finally, let's put this all together to use momentum within SGD:
```{r}
eta <- 0.01
b <- rep(0, p)
m <- 5
E <- 10
mu <- 0.8
v <- 0
err <- rep(NA, nrow(miniBatchIndex)*E)
for (k in 1:E) {
miniBatchIndex <- matrix(sample(1:n),ncol=m)
for (j in 1:nrow(miniBatchIndex)) {
ind <- miniBatchIndex[j,]
bhalf <- b + v * mu
gradMini <- (1/m) * (2 * (t(X[ind,]) %*% X[ind,]) %*% bhalf - 2 * t(X[ind,]) %*% y[ind])
v <- v * mu - (eta) * (1 / k) * gradMini
b <- b + v
err[j+(k-1)*nrow(miniBatchIndex)] <- max(abs(b - betaOls))
}
}
plot(err, log="y", type="l", axes=FALSE, xlab="epoch")
box()
axis(2)
axis(1, at=(0:E)*nrow(miniBatchIndex), 0:E)
abline(v=(1:E)*nrow(miniBatchIndex))
```
Clearly, if we need to get close the true solution in a short amount of time, SGD with
momentum is a great technique. The benefits are even more pronounced when dealing with
the non-convex loss functions of neural networks.