随机优化技术

几种随机优化技术介绍

Stochastic Gradient Descent (SGD): SGD simply updates each parameter by substracting the gradient of the loss w.r.t the parameter, scaled by the learning rate \(\eta\), a hyperparameter. If \(\eta\) is too large, SGD will diverge; if it's too small, it will converge slowly. The update rule is simply \[ \theta_{t+1} = \theta_{t} - \eta \nabla L(\theta{t}) \]
Momentum (动量): In SGD, the gradient \(\nabla L(\theta_{t})\) often changes rapidlly at each iteration \(t\) due to the fact that the loss is being computed over different data. This is often partially mitigated by re-using the gradient value from the previous iteration, scaled by a momentum hyperparameter \(\mu\), as follows: \[ v_{t+1} = \mu v_{t} - \eta \nabla L(\theta{t})\] \[ \theta_{t+1} = \theta_{t} + v_{t+1} \] It has been argued that including the previous gradient step has the effect of approximating some second-order information about the gradient.
AdaGrad: Adagrad effectively rescales the learning rate for each parameter according to the history of the gradients for that parameter. This is done by dividing each item in \(\nabla L\) by the squares root of the sum of squares of its historical gradient. Rescaling in this way effectively lowers the learning rate for parameters which consistently have large gradient values. It also effectively decreases the learning rate over time, because the sum of squares will continue to grow with the iteration. After setting the rescaling term g = 0, the updates are as follows: \[ g_{t+1} = g_{t} + \nabla L(\theta_{t})^{2} \] \[ \theta_{t+1} = \theta_{t} - \frac{\eta \nabla L(\theta_{t})}{\sqrt{g_{t+1}} + \epsilon} \] where division is elementwise and \(\epsilon\) is a small constant included for numerical stability.
RMSProp: RMSProp is very similar to AdaGrad. The only difference is that the \(g_{t}\) term is computed as an exponentially decaying average instead of an accumulated sum. This makes \(g_{t}\) an estimate of the second moment of \(\nabla L\) and avoids the fact that the learning rate effectively shrinks over time. The update is as follows: \[ g_{t+1} = \beta g_{t} + (1 - \beta)\nabla L(\theta_{t})^{2} \] \[ \theta_{t+1} = \theta_{t} - \frac{\eta \nabla L(\theta_{t})}{\sqrt{g_{t+1}} + \epsilon} \]

In the original lecture slides where it was proposed, \(\beta\) is set to 0.9. In , it is shown that the \(\sqrt{g_{t+1}}\) term approximates (in expectation) the diagonal of the absolute value of the Hessian matrix (assuming the update steps are \(N(0,1)\) distributed). It is also argued that the absolute value of the Hessian is better to use for non-convex problems which may have many saddle points.