# Optimization Algorithms for Gradient Descent

## Optimization Algorithms for Gradient Descent

Gradient Descent is one of the most popular and wide used (one can argue that it’s de facto) optimization algorithm in machine learning.
Gradient descent algorithm in the view of machine learning tries to minimize the objective function (more commonly known as loss function) J(θ) with model parameters by updating the parameters in the opposite direction of the gradient of the objective function. The learning rate η determines the size of the steps we take to reach a (local) minimum

• Determining the appropriate learning rate parameter may be difficult
• Too high – The algorithm might not converge to any state value (of loss/model parameter weights)
• Too low – The algorithm might take too long to converge
• The learning scheme doesn’t allow for dynamic learning
• The learning rate cannot be adjusted during training based on change in objective function. Methods to do so require a predefined set of learning rate values to be used at each step.
• The learning rate is static across all parameters, i.e, the update for all parameters takes the same jump step. It is particularly disappointing because the model parameters different frequency and range and hence we would want the update to happen at dfferent speeds for different parameters.

### Some off-the shelf remedies

#### Momentum

The momentum algorithm specifically better address the static learning rate problem of Gradient Descent. Specifically it helps accelerate GD in the relevant direction and dampens oscillations.

Source: WordPress

Momentum works by adding a fraction of γ of the update vector of the past time step to the current update

$v_t = \gamma v_{t-1} + \eta \triangledown_{\theta}J(\theta)$

$\theta&space;=&space;\theta&space;-&space;v_t$

The momentum term γ is usually set to 0.9. The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions. As a result, we gain faster convergence and reduced oscillation.

#### RMSprop

RMSprop is an unpublished, adaptive learning rate method proposed by Geoff Hinton in Lecture 6e of his Coursera Class. RMSprop divides the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight.

$G_j^t = \alpha G_j^{t-1} + (1-\alpha)g_{tj}^2$

$\theta_j^t = \theta_j^{t-1} - \frac{\eta_t}{\sqrt{G_j^t} + \epsilon}g_{tj}$

The parameter ϵ is a very small threshold to avoid division by 0 error and is usually 10-8.

mt, similar to momentum. Whereas momentum can be seen as a ball running down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima in the error surface. We compute the decaying averages of past and past squared gradients
mt and vt respectively as follows:

$m_t = \beta_1 m_{t-1} + (1 - \beta_1)g_t$

$v_t = \beta_2 v_{t-1} + (1 - \beta_2){g_t}^2$

mt and vt are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively, hence the name of the method. As mt and vt are initialized as vectors of 0’s, the authors of Adam observe that they are biased towards zero, especially during the initial time steps, and especially when the decay rates are small (i.e. β1 and β2 are close to 1).These biases are counteracted by computing bias-corrected first and second moment estimates.

$\hat{m_t} = \frac{m_t}{1 - \beta_1^t}$

$\hat{v_t} = \frac{v_t}{1 - \beta_2^t}$

The Adam update rule is as follows:

$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v_t}} + \epsilon}\hat{m_t}$

The parameter ϵ is a very small threshold to avoid division by 0 error. The algorithms usually takes the default values of 0.9 for
β1, 0.999 for β2, and 10−8 for ϵ.