Batch normalization

Batch normalization

Batch normalization (BN) is a method used to make training of artificial neural networks faster and more stable through normalization of the layers’ inputs by re-centering and re-scaling1. It was proposed by Sergey Ioffe and Christian Szegedy in 2015.

Mechanism:

  • It normalizes the activations of each layer for every mini-batch independently.
  • It adds two trainable parameters per layer, which scale and shift the normalized output.
  • It introduces some noise to the output of each layer.

Pros:

  • It helps improve the performance of machine learning models by making training faster and more stable.
  • It helps reduce overfitting.
  • It helps reduce the dependence on initialization.
  • Works well with other optimization methods (SGD w/ momentum, RMSprop, Adam)
  • Allows for higher learning rates

Cons:

  • It can be computationally expensive.
  • It can cause problems when used with small batch sizes.
  • Is not good for RNNs / LSTMs.
  • Have to use a different calculation between training and testing.

Examples:

  • In a convolutional neural network (CNN), batch normalization can be applied after each convolutional layer.
  • In a deep neural network (DNN), batch normalization can be applied after each fully connected layer.

Code using numpy:

def batch_norm(Z, gamma, beta, epsilon):
    """
    @Z is a numpy.ndarray of shape (m, n) that should be normalized
        @m is the number of data points
        @n is the number of features in Z
    @gamma is a numpy.ndarray of shape (1, n)
    containing the scales used for batch normalization
    @beta is a numpy.ndarray of shape (1, n)
    containing the offsets used for batch normalization
    @epsilon is a small number used to avoid division by zero
    Returns: the normalized Z matrix
    """
    # Calculate the mean and variance of Z
    mean = np.mean(Z, axis=0)
    var = np.var(Z, axis=0)

    # Normalize Z
    Z_norm = (Z - mean) / np.sqrt(var + epsilon)

    # Scale and shift the normalized Z using gamma and beta
    Z_norm_scaled_shifted = gamma * Z_norm + beta

    return Z_norm_scaled_shifted

Code using tensorflow:

def create_batch_norm_layer(prev, n, activation):
    """
    @prev is the activated output of the previous layer
    @n is the number of nodes in the layer to be created
    @activation is the activation function that
    should be used on the output of the layer
    Returns: a tensor of the activated output for the layer
    """
    # Initialize the base layer with 'Dense' function from tensorflow
    # The 'prev' input is passed through a dense layer with n nodes
    k_init = tf.contrib.layers.variance_scaling_initializer(mode="FAN_AVG")
    layer = tf.layers.Dense(units=n,
                            kernel_initializer=k_init,
                            use_bias=False)(prev)

    # Initialize trainable parameters 'gamma' and 'beta' as vectors of 1
    # and 0 respectively, with shape (1, n)
    gamma = tf.Variable(initial_value=tf.ones(shape=(1, n)), name="gamma")
    beta = tf.Variable(initial_value=tf.zeros(shape=(1, n)), name="beta")

    # Calculate the batch mean and variance of the previous layer
    # using the tf.nn.moments() function
    mean, variance = tf.nn.moments(layer, axes=[0])

    # Create a batch normalization layer using the
    # tf.nn.batch_normalization() function
    # with gamma, beta, mean, variance and epsilon=1e-8 as arguments
    # Apply the activation function to
    # the output of the batch normalization layer
    # The final result is a tensor of the activated output for the layer
    layer_norm = tf.nn.batch_normalization(layer,
                                           mean,
                                           variance,
                                           beta,
                                           gamma,
                                           1e-8)
    output = activation(layer_norm)
    return output


Comments

Popular posts from this blog

Mini-Batch Gradient Descent

Gradient descent with momentum

Saddle points, ravines and local optimum