Momentum

Generate data

To illustrate the use of momentum in optimization algorithms, let us try to use it to solve for the ordinary least squares solution in linear regression. I will first generate a small test set of data:

set.seed(42)
n <- 1000
p <- 10
X <- matrix(rnorm(n*p),ncol=p)
beta <- 1 / 1:p
y <- X %*% beta + rnorm(n, sd=0.5)

The true solution, which in this case we can calculate analytically, is given by directly solving the normal equations. We can get these in R by the following:

betaOls <- coef(lm(y ~ X - 1))

Standard gradient descent

Standard gradient descent calculates the entire gradient function and moves some amount in the opposite direction of the gradient. Here we run this algorithm for 20 iterations and plot the error on the log scale:

eta <- 0.01
b <- rep(0, p)
E <- 100
err <- rep(NA, E)
for (k in 1:E) {
  gradFull <- (2/n) * ((t(X) %*% X) %*% b - t(X) %*% y)
  v <- -1 * eta * gradFull
  b <- b + v
  err[k] <- max(abs(b - betaOls))
}

plot(err, log="y", type="l", xlab="iteration")
grid()

Notice that the log error rate is decreasing almost exactly as a linear function of the iteration. We see that the error is still relatively high after 100 iterations.

Newton’s method

Newton’s method uses a quadratic approximation to the function to estimate not just which direction to move in but how far to move in that direction. It is often augmented with a learning rate, though this is much different than the rate in the gradient descent algorithm because in theory it should converge under some weak assumptions even when the learning rate is 1.

eta <- 0.01
v <- 0
gamma <- 0.5
b <- rep(0, p)
E <- 100
err <- rep(NA, E)
for (k in 1:E) {
  gradFull <- (2/n) * ((t(X) %*% X) %*% b - t(X) %*% y)
  hessFull <- -(2/n) * t(X) %*% X
  v <- gamma * solve(hessFull) %*% gradFull
  b <- b + v
  err[k] <- max(abs(b - betaOls))
}

plot(err, log="y", type="l", xlab="iteration")
grid()

Because the linear regression problem is itself a quadratic form, using a learning rate of 1 would solve the problem exactly. We see that even setting it to 0.5 converges to double precision within 50 steps.

Basic momentum

What if we approximate the hessian information by using basic momentum, with a mu of 0.8?

eta <- 0.01
v <- 0
mu <- 0.8
b <- rep(0, p)
E <- 100
err <- rep(NA, E)
for (k in 1:E) {
  gradFull <- (2/n) * ((t(X) %*% X) %*% b - t(X) %*% y)
  v <- v * mu - eta * gradFull
  b <- b + v
  err[k] <- max(abs(b - betaOls))
}

plot(err, log="y", type="l", xlab="iteration")
grid()

Though slightly noisier at some points along the curve, this gives a drastically better convergence rate!

Nesterov’s accelerated gradient (i.e., momentum)

Finally, we’ll use Nesterov’s trick to approximate the moment instead:

eta <- 0.01
v <- 0
mu <- 0.8
b <- rep(0, p)
E <- 100
err <- rep(NA, E)
for (k in 1:E) {
  bhalf <- b + v * mu
  gradFull <- (2/n) * ((t(X) %*% X) %*% bhalf - t(X) %*% y)
  v <- v * mu - eta * gradFull
  b <- b + v
  err[k] <- max(abs(b - betaOls))
}

plot(err, log="y", type="l", xlab="iteration")
grid()

The benefit here is slight, but still noticeable. For noisier functions, where the gradient is changing rapidly, the differences will become more pronounced.

Example of SGD

Finally, let’s put this all together to use momentum within SGD:

eta <- 0.01
b <- rep(0, p)
m <- 5
E <- 10
mu <- 0.8
v <- 0
err <- rep(NA, nrow(miniBatchIndex)*E)
for (k in 1:E) {
  miniBatchIndex <- matrix(sample(1:n),ncol=m)
  for (j in 1:nrow(miniBatchIndex)) {
    ind <- miniBatchIndex[j,]
    bhalf <- b + v * mu
    gradMini <- (1/m) * (2 * (t(X[ind,]) %*% X[ind,]) %*% bhalf - 2 * t(X[ind,]) %*% y[ind])
    v <- v * mu - (eta) * (1 / k) * gradMini
    b <- b + v
    err[j+(k-1)*nrow(miniBatchIndex)] <- max(abs(b - betaOls))
  }
}

plot(err, log="y", type="l", axes=FALSE, xlab="epoch")
box()
axis(2)
axis(1, at=(0:E)*nrow(miniBatchIndex), 0:E)
abline(v=(1:E)*nrow(miniBatchIndex))