Adam Optimizer in Tensorflow

Python Server Side Programming Programming

Adam optimizer in Tensorflow is an algorithm used in deep learning models. Optimization algorithms are used in deep learning models to minimize the loss function and improve performance. Adam stands for Adaptive Moment Estimation, which is a stochastic gradient descent algorithm. It combines the advantages of both RMSprop and AdaGrad algorithms to achieve a better optimization result. In this article, we will understand the Adam Optimizer in Tensorflow and how it works.

Working principle of Adam Optimizer

Adam optimizer is an iterative optimization algorithm. It uses first and second-order moments of the gradient to adaptively adjust the learning rate for each parameter. The algorithm takes into account two moving averages of the gradients - the exponentially decaying average of the past gradient and another gradient is the moment of the gradients.

Algorithm for Updating Parameter

Calculate the gradient of the loss function with respect to the parameters.
Calculate the first moment(mean) and the second moment(the uncentered variance) of the gradients.
Update the parameter using the first and second moment of gradients and the learning rate.

The update equation for the parameter is given below −

w(t+1) = w(t) - α * m_t / (sqrt(v_t) + ε)

Here w(t) is the parameter at iteration t, α is the learning rate, m_t is the first moment of gradient (mean) and v_t is the second moment of gradient and ε is a small constant to prevent division by zero.

To calculate the first moment the below expression is used −

m_t = β1 * m_(t-1) + (1- β1) * g_t

Here,m_(t-1) is the first moment of the gradient at the previous iteration, β1 is the decay rate for the first moment, and g_t is the gradient at the current iteration.

To calculate the second moment the below expression is used −

v_t = β2 * v_(t-1) + (1- β2) * g_t^2

Here, v_(t-1) is the second moment of the gradient at the previous iteration, β2 is the decay rate for the second moment, and g_t^2 is the squared gradient at the current iteration.

Example

In the below example, we are using Adam optimizer in TensorFlow to train a neural network on the MNIST dataset.First, we import the necessary libraries and load the MNIST dataset.Next, we define the neural network model.Then, we compile the model and specify the Adam optimizer.Finally, we train the model using the fit() method.

During the training process, the Adam optimizer adaptively adjusts the learning rate for each parameter, which helps the model converge faster and achieve better performance on the validation set. The history variable contains the training and validation metrics, such as loss and accuracy, for each epoch.

import tensorflow as tf
from tensorflow.keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10)
])

model.compile(optimizer='adam',
   loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
   metrics=['accuracy'])

history = model.fit(x_train, y_train, epochs=5, validation_data=(x_test, y_test))

Output

The output of the above code will be the training and validation metrics, such as loss and accuracy, for each epoch of training.

This output shows that the model is improving with each epoch, as the training and validation loss are decreasing and the training and validation accuracy are increasing. By the end of the fifth epoch, the model achieves a validation accuracy of 97.65%, which indicates that it is able to accurately classify handwritten digits in the MNIST dataset.

Epoch 1/5
1875/1875 [==============================] - 21s 9ms/step - loss: 0.2933 - accuracy: 0.9156 - val_loss: 0.1332 - val_accuracy: 0.9612
Epoch 2/5
1875/1875 [==============================] - 10s 5ms/step - loss: 0.1422 - accuracy: 0.9571 - val_loss: 0.0985 - val_accuracy: 0.9693
Epoch 3/5
1875/1875 [==============================] - 9s 5ms/step - loss: 0.1071 - accuracy: 0.9672 - val_loss: 0.0850 - val_accuracy: 0.9725
Epoch 4/5
1875/1875 [==============================] - 9s 5ms/step - loss: 0.0884 - accuracy: 0.9725 - val_loss: 0.0819 - val_accuracy: 0.9750
Epoch 5/5
1875/1875 [==============================] - 10s 5ms/step - loss: 0.0767 - accuracy: 0.9765 - val_loss: 0.0836 - val_accuracy: 0.975

Advantage of Adam Optimizer

Adaptive learning rate − Adam optimizer adaptively adjust the learning rate for each parameter which makes it suitable for problems with the sparse gradient or noisy gradient.
Fast Convergence − Adam optimizer uses the momentum and the second moment of the gradients to speed up the convergence rate of the optimization process.
Efficient Memory Usage − Adam optimizer maintains only two moving averages of the gradients, which makes it memory-efficient compared to other optimization algorithms that require the storage of a large number of past gradients.

Disadvantages of Adam Optimizer

Overfitting − Adam optimizer is prone to overfitting, especially when the dataset is small. This is because the algorithm can converge too quickly and may overfit to the training data.
Sensitive to Learning Rate − Adam optimizer is sensitive to the learning rate hyperparameter. Setting the learning rate too high can cause the optimization process to diverge while setting it too low can slow down the convergence rate.

Applications of Adam Optimizer

Some of the uses of Adam Optimizer are −

Computer vision − Adam optimizer has been used in various computer vision tasks such as image classification, object detection, and image segmentation. For example, the popular YOLO (You Only Look Once) object detection algorithm uses Adam optimizer to train its neural network.
Natural language processing − Adam optimizer has been used in natural languages processing tasks such as sentiment analysis, language translation, and text generation. For example, the GPT (Generative Pre-trained Transformer) language model uses Adam optimizer to train its neural network.
Speech recognition − Adam optimizer has been used in speech recognition tasks such as automatic speech recognition and speaker identification. For example, the DeepSpeech speech recognition system uses Adam optimizer to train its neural network.
Reinforcement learning − Adam optimizer has also been used in reinforcement learning tasks such as playing games and controlling robots. For example, the OpenAI Gym toolkit uses Adam Optimizer to train its deep reinforcement learning agents.
Medical imaging − Adam Optimizer has been used in medical imaging tasks such as diagnosing diseases and analyzing medical images. For example, the DeepLesion lesion detection system uses the Adam optimizer to train its neural network.

Conclusion

In this article, we have discussed the Adam optimizer and how it is used in deep learning models to due its adaptive learning rate. We also discussed the expression used in the algorithm to calculate the updated value of the parameter, the first and the second moment of the gradient. Adam optimizer also comes with its own advantages and drawbacks as discussed in this article.

Rohan Singh

Updated on: 06-Jul-2023

591 Views

Kickstart Your Career

Get certified by completing the course

Get Started