The idea behind using gradient descent is to minimize the loss when in various machine learning algorithms. Mathematically speaking, the local minimum of a function is obtained.
To implement this, a set of parameters are defined, and they need to be minimized. Once the parameters are assigned coefficients, the error or loss is calculated. Next, the weights are updated to ensure that the error is minimized. Instead of parameters, weak learners can be users, such as decision trees.
Once the loss is calculated, gradient descent is performed, and tree is added to the algorithm step wise, so that loss is minimal.
Some examples includes coefficient parameters in linear regression or making sure that optimal weights are used in a machine learning algorithm.
There are different types of gradient descent algorithms and some of them have been discussed below.
It is a type of gradient descent algorithm that processes all training data set for every iteration of the algorithm’s run.
If the number of training data is huge, batch gradient descent is computationally expensive. Hence, it wouldn’t be preferred to use batch gradient descent when the dataset is large.
In such cases, if the number of training examples is large, then stochastic gradient descent or mini-batch gradient descent is preferred.
This algorithm processes one training sample in every iteration. The parameters get updated after every iteration since only one data sample is worked on in every iteration.
It is quicker in comparison to batch gradient descent. The overhead is high if the number of training samples in the dataset is large.
This is because the number of iterations would be high and the amount of time taken would also be high.
This gradient descent algorithm works better than batch gradient descent and stochastic gradient descent. Here, ‘b’ number of examples are processed in every iteration, where b<m.
The value ‘m’ refers to the total number of training examples in the dataset.The value ‘b’ is a value less than ‘m’. If the number of training examples is high, data is processed in batches, where every batch would contain ‘b’ training examples in one iteration.
Mini batch gradient descent works well with large training examples in reduced number of iterations.