RMSProp optimization

 RMSProp optimization


RMSprop is a widely used optimization algorithm for neural networks that helps to accelerate the convergence of the gradient descent process. It was first introduced by Geoff Hinton in 2012.

The main idea behind RMSprop is to normalize the gradients by dividing them by a running average of their magnitudes. This helps to reduce the oscillations in the gradient descent process and ensures that the step sizes are adaptive, i.e., larger when the gradient is small and smaller when the gradient is large.

The mechanism of RMSprop involves maintaining a moving average of the squared gradients for each weight. This moving average is updated at each iteration using a decay rate hyperparameter, typically set to a value between 0.9 and 0.99. The current gradient is then divided by the square root of the moving average, which effectively normalizes the gradient.

With RMSprop and Adam you need to add a small epsilon to the denominator to avoid division by 0.

Divide the gradient by the square root of the moving (propagated) average (mean) of the square of the gradient:

s_t = beta * s_t-1 + (1 - beta)square(Grad(w))

w_t = w_t-1 - alpha * Grad(w) / sqrt(s_t)

Pros:

  • RMSprop adapts the learning rate for each parameter based on the history of its gradients, which can help the algorithm converge faster.
  • The algorithm is robust to different choices of hyperparameters and works well in practice.
  • Helps to deal with ravines and saddle points
  • It is computationally efficient and scales well to large datasets and deep neural networks.

Cons:

  • It can be sensitive to the choice of the initial learning rate and other hyperparameters, which can affect the convergence and performance of the algorithm.
  • RMSprop can sometimes get stuck in a local minimum or a plateau, especially in deep neural networks with many parameters.
  • The algorithm can be prone to overfitting if the regularization is not applied properly.

Code with numpy:

def update_variables_RMSProp(alpha, beta2, epsilon, var, grad, s):
    """
    @alpha is the learning rate
    @beta2 is the RMSProp weight
    @epsilon is a small number to avoid division by zero
    @var is a numpy.ndarray containing the variable to be updated
    @grad is a numpy.ndarray containing the gradient of var
    @s is the previous second moment of var
    Returns: the updated variable and the new moment, respectively
    """
    s = beta2 * s + (1 - beta2) * (grad ** 2)
    var -= alpha * grad / (np.sqrt(s) + epsilon)
    return var, s

Code with tensorflow:

def create_RMSProp_op(loss, alpha, beta2, epsilon):
    """
    loss is the loss of the network
    alpha is the learning rate
    beta2 is the RMSProp weight
    epsilon is a small number to avoid division by zero
    Returns: the RMSProp optimization operation
    """
    optimizer = tf.train.RMSPropOptimizer(learning_rate=alpha,
                                          decay=beta2,
                                          epsilon=epsilon)
    train_op = optimizer.minimize(loss)
    return train_op

Comments

Popular posts from this blog

Mini-Batch Gradient Descent

Gradient descent with momentum

Saddle points, ravines and local optimum