Adam

 Adam





Adam (Adaptive Moment Estimation) is a popular optimization algorithm used in deep learning. It combines the benefits of both RMSprop and momentum methods to improve the convergence of the optimization process.

The mechanism of Adam is similar to RMSprop, where it computes an adaptive learning rate for each parameter, but it also incorporates momentum by using moving averages of both the gradients and the squared gradients. In addition, Adam includes a bias correction mechanism to compensate for the biased estimates of the first and second moments of the gradients.

m_t = beta1 * m_t-1 + (1 - beta1) * Grad(W); v_t = beta2 * v_t-1 + (1 - beta2) * (Grad(W) ^ 2)


m_tnorm = m_t / (1 - (beta1 ^ t)); v_tnorm = m_t/ (1 - (beta2 ^ t))


w_t = w_t-1 - alpha * m_tnorm / sqrt(v_tnorm)

The algorithm of Adam can be summarized as follows:

  1. Compute the gradient of the objective function with respect to the parameters.
  2. Compute the first and second moments of the gradients using exponential moving averages.
  3. Apply bias correction to the first and second moment estimates.
  4. Update the parameters using the corrected estimates and a learning rate.

Pros:

  • Efficient use of memory.
  • Computationally efficient.
  • Adaptive learning rate.
  • Robust to noisy data and sparse gradients.
  • Well-suited for large datasets and high-dimensional parameter spaces.
  • Good at navigating ravines and saddle points

Cons:

  • Possible to overfit to noisy gradients.
  • Sensitive to initial learning rate and hyperparameters.
  • May converge to suboptimal solutions if the objective function is not smooth or has narrow valleys.
  • May require additional tuning compared to other optimization algorithms.

Code with numpy:

def update_variables_Adam(alpha, beta1, beta2, epsilon, var, grad, v, s, t):
    """
    @alpha is the learning rate
    @beta1 is the weight used for the first moment
    @beta2 is the weight used for the second moment
    @epsilon is a small number to avoid division by zero
    @var is a numpy.ndarray containing the variable to be updated
    @grad is a numpy.ndarray containing the gradient of var
    @v is the previous first moment of var
    @s is the previous second moment of var
    @t is the time step used for bias correction
    Returns: the updated variable, the new first moment,
    and the new second moment, respectively
    """

    # Update biased first moment estimate
    v = beta1 * v + (1 - beta1) * grad

    # Update biased second moment estimate
    s = beta2 * s + (1 - beta2) * grad**2

    # Correct bias in first moment
    v_corrected = v / (1 - beta1**t)

    # Correct bias in second moment
    s_corrected = s / (1 - beta2**t)

    # Update variable
    updated_var = var - alpha * v_corrected / (np.sqrt(s_corrected) +
                                               epsilon)

    return updated_var, v, s

Code with tensorflow:

def create_Adam_op(loss, alpha, beta1, beta2, epsilon):
    """
    @loss is the loss of the network
    @alpha is the learning rate
    @beta1 is the weight used for the first moment
    @beta2 is the weight used for the second moment
    @epsilon is a small number to avoid division by zero
    Returns: the Adam optimization operation
    """
    # Create global step variable
    global_step = tf.Variable(0, trainable=False)

    # Define Adam optimizer with given hyperparameters
    optimizer = tf.train.AdamOptimizer(learning_rate=alpha, beta1=beta1,
                                       beta2=beta2, epsilon=epsilon)

    # Compute gradients and apply them using the Adam optimizer
    grads_and_vars = optimizer.compute_gradients(loss)
    train_op = optimizer.apply_gradients(grads_and_vars,
                                         global_step=global_step)

    return train_op


Comments

Popular posts from this blog

Mini-Batch Gradient Descent

Saddle points, ravines and local optimum

Gradient descent with momentum