Optimizers

Optimizers

Optimizers are algorithms or methods used to change the attributes of your neural network such as weights and learning rate in order to reduce the losses.


We’ll learn about different types of optimizers and their advantages:


Gradient Descent

Gradient Descent is the most basic but most used optimization algorithm. It’s used heavily in linear regression and classification algorithms. Backpropagation in neural networks also uses a gradient descent algorithm.


Gradient descent is a first-order optimization algorithm which is dependent on the first order derivative of a loss function. It calculates that which way the weights should be altered so that the function can reach a minima. Through backpropagation, the loss is transferred from one layer to another and the model’s parameters also known as weights are modified depending on the losses so that the loss can be minimized.


Advantages:

  1. 1. Easy computation.
  2. 2. Easy to implement.
  3. 3. Easy to understand.

Disadvantages:


  1. 1. May trap at local minima.
  2. 2. Weights are changed after calculating gradient on the whole dataset. So, if the dataset is too large than this may take years to converge to the minima.
  3. 3. Requires large memory to calculate gradient on the whole dataset.

Stochastic Gradient Descent

It’s a variant of Gradient Descent. It tries to update the model’s parameters more frequently. In this, the model parameters are altered after computation of loss on each training example. So, if the dataset contains 1000 rows SGD will update the model parameters 1000 times in one cycle of dataset instead of one time as in Gradient Descent.


Advantages:

  1. 1. Frequent updates of model parameters hence, converges in less time.
  2. 2. Requires less memory as no need to store values of loss functions.
  3. 3. May get new minima’s.

Disadvantages:

  1. 1. High variance in model parameters.
  2. 2. May shoot even after achieving global minima.
  3. 3. To get the same convergence as gradient descent needs to slowly reduce the value of learning rate.

Mini-Batch Gradient Descent

It’s best among all the variations of gradient descent algorithms. It is an improvement on both SGD and standard gradient descent. It updates the model parameters after every batch. So, the dataset is divided into various batches and after every batch, the parameters are updated.


Advantages:

  1. 1. Frequently updates the model parameters and also has less variance.
  2. 2. Requires medium amount of memory.

All types of Gradient Descent have some challenges:

  1. 1. Choosing an optimum value of the learning rate. If the learning rate is too small than gradient descent may take ages to converge.
  2. 2. Have a constant learning rate for all the parameters. There may be some parameters which we may not want to change at the same rate.
  3. 3. May get trapped at local minima.

Adam

Adam (Adaptive Moment Estimation) works with momentums of first and second order. The intuition behind the Adam is that we don’t want to roll so fast just because we can jump over the minimum, we want to decrease the velocity a little bit for a careful search. In addition to storing an exponentially decaying average of past squared gradients like AdaDelta, Adam also keeps an exponentially decaying average of past gradients


Advantages:

  1. 1. The method is too fast and converges rapidly.
  2. 2. Rectifies vanishing learning rate, high variance.

All types of Gradient Descent have some challenges:

  1. Computationally costly.


About the Author



Silan Software is one of the India's leading provider of offline & online training for Java, Python, AI (Machine Learning, Deep Learning), Data Science, Software Development & many more emerging Technologies.

We provide Academic Training || Industrial Training || Corporate Training || Internship || Java || Python || AI using Python || Data Science etc





 PreviousNext