Difference Between SGD, GD, and Mini-batch GD

Machine learning largely relies on optimization algorithms since they help to alter the model's parameters to improve its performance on training data. Using these methods, the optimal set of parameters to minimize a cost function can be identified. The optimization approach adopted can have a significant impact on the rate of convergence, the amount of noise in the updates, and the efficacy of the model's generalization. It is essential to use the right optimization method for a certain case in order to guarantee that the model is optimized successfully and reaches optimal performance. Stochastic Gradient Descent (SGD), Gradient Descent (GD), and mini-batch Gradient Descent are the most prominent optimization strategies. It is essential to comprehend how different algorithms differ in order to select the best one for your use case. In this post, we will go through the distinctions between SGD, GD, and mini-batch GD in depth.

What is Gradient Descent?

The Gradient Descent (GD) optimization approach is frequently used in machine learning to assist identify the ideal set of parameters for a particular model. GD updates the model parameters iteratively depending on the gradient of the cost function relative to the model parameters. The gradient indicates in the direction of the cost function's sharpest rise, therefore by traveling in the opposite direction, the method seeks to locate the cost function's lowest. GD can be computationally expensive, too, as each iteration of the algorithm necessitates computing the gradient of the cost function across the whole training dataset. GD is frequently used as a benchmark for other optimization methods as it can converge to the global minimum of the cost function under specific circumstances.

What is Stochastic Gradient Descent?

SGD (Stochastic Gradient Descent) is a well-known machine learning optimization technique. In this variation of gradient descent, the model parameters are adjusted for each iteration depending on the gradient of the cost function relative to a single training sample. Each iteration of this approach selects a single training sample at random. Gradient Descent modifies the model parameters less often than SGD, leading to faster convergence. Yet, using a single training sample at random might lead to noisy updates and a very variable cost function. SGD, despite its noise, is commonly preferred over Gradient Descent because it converges more quickly and requires less memory to store the cost function gradients.

What is Mini-batch Gradient Descent?

Mini-batch Gradient Descent is a Gradient Descent version that falls in between Stochastic Gradient Descent and Gradient Descent. The model parameters are updated based on the average gradient of the cost function with respect to the model parameters across each mini-batch, which are smaller subsets of the training dataset of equal sizes. When compared to Gradient Descent and Stochastic Gradient Descent, Mini-batch Gradient Descent changes the model parameters more often. The noise of stochastic updates and the computing cost of full-batch updates are traded off, and mini-batch gradient descent strikes a compromise between the two. It is the deep learning optimization method that is most frequently employed and provides a fair balance between speed and accuracy.

Difference Between Gradient descent, Stochastic Gradient Descent, and Mini-batch Gradient Descent

Gradient Descent

Stochastic Gradient Descent

Mini-batch gradient descent

Gradient Descent determines the cost function's gradient throughout the whole training dataset and updates the model's parameters based on the mean of all training examples across each epoch.

Stochastic gradient descent involves updating the model parameters and computing the gradient of the cost function for a single random training example at each iteration.

Mini-batch Gradient Descent updates the model parameters based on the mean gradient of the cost function with respect to the model parameters over a mini-batch, which is a smaller subset of the training dataset of equivalent size.

As each iteration of the approach requires computing the gradient of the cost function across the whole training dataset, GD takes some time to converge.

SGD adjusts the model parameters more often than GD, which causes it to converge more quickly

In order to strike a reasonable balance between speed and accuracy, the model parameters are changed more frequently than GD but less frequently than SGD.

Due to the requirement to retain the whole training dataset, GD consumes a lot of memory.

As just one training sample needs to be stored for each iteration, SGD requires less memory.

Just a percentage of the training samples had to be retained for each repetition, therefore the memory use is manageable.

GD is computationally expensive because the gradient of the cost function must be computed for the whole training dataset at each iteration.

As the cost function's gradient only needs to be calculated once for each repeat of training data, SGD is computationally efficient.

As the gradient of the cost function must be calculated for a portion of the training examples for each iteration, it is computationally efficient.

With little error, GD modifies the model's parameters based on the average of all training samples.

Due to the fact that SGD is updated using just one training sample, it has a lot of noise.

Mini-batch Gradient Descent has a significant amount of noise because the update is based on a small number of training examples.


In conclusion, the most popular machine learning optimization methods are gradient descent, stochastic gradient descent, and mini-batch gradient descent. Stochastic Gradient Descent converges quickly but has high noise, whereas Gradient Descent converges slowly but has low noise. With a reasonable level of noise, Mini-batch Gradient Descent strikes a decent balance between speed and accuracy. The size of the dataset, the amount of memory that is available, and the level of precision necessary all play a role in selecting the best method. Understanding the features of each algorithm will help you choose the best one for a given problem as a data scientist or machine learning practitioner.

Updated on: 25-Apr-2023


Kickstart Your Career

Get certified by completing the course

Get Started