Posts

Showing posts from April, 2023

Learning Rate decay

Image
Learning Rate Decay Learning rate decay is a technique used to adjust the learning rate during training in order to improve the convergence of the optimization algorithm. The idea behind learning rate decay is to gradually decrease the learning rate over time as the training progresses, so that the optimization process can focus on the finer details of the loss function. Learning Rate Decay methods: Exponential decay:      lr = lr_0 * e^(- d * t) The learning rate is decreased exponentially over time, where the decay rate d and initial learning rate lr_0 are hyperparameters. The learning rate decreases quickly at the beginning and then slows down as time goes by. Time-based decay:      lr = lr_0 / (1 + d * t) The learning rate is decreased linearly over time, where the decay rate d and initial learning rate lr_0 are hyperparameters. The learning rate decreases in a steady and gradual manner. Step decay:      lr = lr_0 * d^(epoch // epoch_drops) The learning rate is decreased by a facto

Adam

Image
 Adam Adam (Adaptive Moment Estimation) is a popular optimization algorithm used in deep learning. It combines the benefits of both RMSprop and momentum methods to improve the convergence of the optimization process. The mechanism of Adam is similar to RMSprop, where it computes an adaptive learning rate for each parameter, but it also incorporates momentum by using moving averages of both the gradients and the squared gradients. In addition, Adam includes a bias correction mechanism to compensate for the biased estimates of the first and second moments of the gradients. m_t = beta1 * m_t-1 + (1 - beta1) * Grad(W); v_t = beta2 * v_t-1 + (1 - beta2) * (Grad(W) ^ 2) m_tnorm = m_t / (1 - (beta1 ^ t)); v_tnorm = m_t/ (1 - (beta2 ^ t)) w_t = w_t-1 - alpha * m_tnorm / sqrt(v_tnorm) The algorithm of Adam can be summarized as follows: Compute the gradient of the objective function with respect to the parameters. Compute the first and second moments of the gradients using exponential moving ave

RMSProp optimization

Image
 RMSProp optimization RMSprop is a widely used optimization algorithm for neural networks that helps to accelerate the convergence of the gradient descent process. It was first introduced by Geoff Hinton in 2012. The main idea behind RMSprop is to normalize the gradients by dividing them by a running average of their magnitudes. This helps to reduce the oscillations in the gradient descent process and ensures that the step sizes are adaptive, i.e., larger when the gradient is small and smaller when the gradient is large. The mechanism of RMSprop involves maintaining a moving average of the squared gradients for each weight. This moving average is updated at each iteration using a decay rate hyperparameter, typically set to a value between 0.9 and 0.99. The current gradient is then divided by the square root of the moving average, which effectively normalizes the gradient. With RMSprop and Adam you need to add a small epsilon to the denominator to avoid division by 0. Divide the gradien

Saddle points, ravines and local optimum

Saddle points, ravines and local optimum: In the context of optimization, a saddle point is a critical point of a function where the slopes (gradients) of the function in different directions are zero, but some directions are minima and others are maxima. This means that a gradient descent algorithm may get stuck at the saddle point and fail to converge to the global optimum. A ravine is a narrow valley in the optimization landscape where the gradients are steep and the optimization algorithm can converge quickly. However, ravines can also be problematic because they can lead to oscillations and slow convergence when the optimization algorithm overshoots the minimum. A local optimum is a point in the optimization landscape where the function has the lowest value in a local neighborhood, but there may be other points that have lower values in other parts of the landscape. Optimization algorithms like gradient descent can converge to local optima instead of the global optimum, which can

Gradient descent with momentum

Image
 Gradient descent with momentum Gradient descent with momentum is an optimization algorithm that is used to update the weights of a neural network during training. It is an extension of the standard gradient descent algorithm that adds a momentum term to the update rule. The momentum term helps accelerate gradients vectors in the right directions, thus leading to faster convergence. The basic idea of Gradient Descent with momentum is to calculate the exponentially weighted average of past gradients and use this average to update the weights instead of using only the current gradient. This helps smooth out fluctuations in the gradient and helps prevent oscillations in the optimization process. How works? v_t = beta * v_t-1 + alpha * Grad(w) w = w - v_t Pros: It also works much faster than the algorithm Standard Gradient Descent. Less oscillation around ravines and local optima Cons: It requires tuning of an additional hyperparameter (momentum) which can be time-consuming. Can cause mode

Mini-Batch Gradient Descent

Image
Mini-Batch Gradient Descent Mini-Batch Gradient Descent is a variation of the Gradient Descent algorithm that is commonly used in deep learning. It is also known as Stochastic Gradient Descent (SGD), although technically, SGD refers to the version of the algorithm that uses a batch size of 1. To implement Mini-Batch Gradient Descent, we first randomly shuffle the training data, and then divide it into batches of a fixed size. We then loop over the batches and perform the following steps for each batch: Compute the gradients of the loss function with respect to the model parameters using the current batch of data. Update the model parameters using the computed gradients and the learning rate. Repeat until the entire dataset has been processed a fixed number of times (epochs). Mechanism: It updates the model in small batches instead of one large batch. It reduces the variance of the parameter updates, which can lead to more stable convergence. It can make use of highly optimized matrix o

Batch normalization

Image
Batch normalization Batch normalization (BN) is a method used to make training of artificial neural networks faster and more stable through normalization of the layers’ inputs by re-centering and re-scaling1. It was proposed by Sergey Ioffe and Christian Szegedy in 2015. Mechanism: It normalizes the activations of each layer for every mini-batch independently. It adds two trainable parameters per layer, which scale and shift the normalized output. It introduces some noise to the output of each layer. Pros: It helps improve the performance of machine learning models by making training faster and more stable. It helps reduce overfitting. It helps reduce the dependence on initialization. Works well with other optimization methods (SGD w/ momentum, RMSprop, Adam) Allows for higher learning rates Cons: It can be computationally expensive. It can cause problems when used with small batch sizes. Is not good for RNNs / LSTMs. Have to use a different calculation between training and testing. Ex

Feature Scaling

Image
Feature Scaling Feature scaling is a data preprocessing technique that involves transforming the values of features or variables in a dataset to a similar scale. This is done to ensure that all features contribute equally to the model and to prevent features with larger values from dominating the model. Feature scaling is not strictly necessary in all Machine Learning models. There are two main feature scaling techniques: min-max scaler and standard scaler. The min-max scaler responds well for features with distributions which are not Gaussian, while the standard scaler responds well for features with Gaussian distributions. Rescaling/min-max normalization: x_p = (x -min(x)) / (max(x) - min(x)) Standardization or Z-score normalization: x_p = (x - m)/s Pros: It helps improve the performance of machine learning models by ensuring that all features contribute equally to the model. It helps prevent features with larger values from dominating the model. It helps improve the convergence rate