 Scikit Learn Tutorial
 Scikit Learn  Home
 Scikit Learn  Introduction
 Scikit Learn  Modelling Process
 Scikit Learn  Data Representation
 Scikit Learn  Estimator API
 Scikit Learn  Conventions
 Scikit Learn  Linear Modeling
 Scikit Learn  Extended Linear Modeling
 Stochastic Gradient Descent
 Scikit Learn  Support Vector Machines
 Scikit Learn  Anomaly Detection
 Scikit Learn  KNearest Neighbors
 Scikit Learn  KNN Learning
 Classification with Naïve Bayes
 Scikit Learn  Decision Trees
 Randomized Decision Trees
 Scikit Learn  Boosting Methods
 Scikit Learn  Clustering Methods
 Clustering Performance Evaluation
 Dimensionality Reduction using PCA
 Scikit Learn Useful Resources
 Scikit Learn  Quick Guide
 Scikit Learn  Useful Resources
 Scikit Learn  Discussion
 Selected Reading
 UPSC IAS Exams Notes
 Developer's Best Practices
 Questions and Answers
 Effective Resume Writing
 HR Interview Questions
 Computer Glossary
 Who is Who
Scikit Learn  Stochastic Gradient Descent
Here, we will learn about an optimization algorithm in Sklearn, termed as Stochastic Gradient Descent (SGD).
Stochastic Gradient Descent (SGD) is a simple yet efficient optimization algorithm used to find the values of parameters/coefficients of functions that minimize a cost function. In other words, it is used for discriminative learning of linear classifiers under convex loss functions such as SVM and Logistic regression. It has been successfully applied to largescale datasets because the update to the coefficients is performed for each training instance, rather than at the end of instances.
SGD Classifier
Stochastic Gradient Descent (SGD) classifier basically implements a plain SGD learning routine supporting various loss functions and penalties for classification. Scikitlearn provides SGDClassifier module to implement SGD classification.
Parameters
Followings table consist the parameters used by SGDClassifier module −
Sr.No  Parameter & Description 

1 
loss − str, default = ‘hinge’ It represents the loss function to be used while implementing. The default value is ‘hinge’ which will give us a linear SVM. The other options which can be used are −

2 
penalty − str, ‘none’, ‘l2’, ‘l1’, ‘elasticnet’ It is the regularization term used in the model. By default, it is L2. We can use L1 or ‘elasticnet; as well but both might bring sparsity to the model, hence not achievable with L2. 
3 
alpha − float, default = 0.0001 Alpha, the constant that multiplies the regularization term, is the tuning parameter that decides how much we want to penalize the model. The default value is 0.0001. 
4 
l1_ratio − float, default = 0.15 This is called the ElasticNet mixing parameter. Its range is 0 < = l1_ratio < = 1. If l1_ratio = 1, the penalty would be L1 penalty. If l1_ratio = 0, the penalty would be an L2 penalty. 
5 
fit_intercept − Boolean, Default=True This parameter specifies that a constant (bias or intercept) should be added to the decision function. No intercept will be used in calculation and data will be assumed already centered, if it will set to false. 
6 
tol − float or none, optional, default = 1.e3 This parameter represents the stopping criterion for iterations. Its default value is False but if set to None, the iterations will stop when 𝒍loss > best_loss  tol for n_iter_no_changesuccessive epochs. 
7 
shuffle − Boolean, optional, default = True This parameter represents that whether we want our training data to be shuffled after each epoch or not. 
8 
verbose − integer, default = 0 It represents the verbosity level. Its default value is 0. 
9 
epsilon − float, default = 0.1 This parameter specifies the width of the insensitive region. If loss = ‘epsiloninsensitive’, any difference, between current prediction and the correct label, less than the threshold would be ignored. 
10 
max_iter − int, optional, default = 1000 As name suggest, it represents the maximum number of passes over the epochs i.e. training data. 
11 
warm_start − bool, optional, default = false With this parameter set to True, we can reuse the solution of the previous call to fit as initialization. If we choose default i.e. false, it will erase the previous solution. 
12 
random_state − int, RandomState instance or None, optional, default = none This parameter represents the seed of the pseudo random number generated which is used while shuffling the data. Followings are the options.

13 
n_jobs − int or none, optional, Default = None It represents the number of CPUs to be used in OVA (One Versus All) computation, for multiclass problems. The default value is none which means 1. 
14 
learning_rate − string, optional, default = ‘optimal’

15 
eta0 − double, default = 0.0 It represents the initial learning rate for above mentioned learning rate options i.e. ‘constant’, ‘invscalling’, or ‘adaptive’. 
16 
power_t − idouble, default =0.5 It is the exponent for ‘incscalling’ learning rate. 
17 
early_stopping − bool, default = False This parameter represents the use of early stopping to terminate training when validation score is not improving. Its default value is false but when set to true, it automatically set aside a stratified fraction of training data as validation and stop training when validation score is not improving. 
18 
validation_fraction − float, default = 0.1 It is only used when early_stopping is true. It represents the proportion of training data to set asides as validation set for early termination of training data.. 
19 
n_iter_no_change − int, default=5 It represents the number of iteration with no improvement should algorithm run before early stopping. 
20 
classs_weight − dict, {class_label: weight} or “balanced”, or None, optional This parameter represents the weights associated with classes. If not provided, the classes are supposed to have weight 1. 
20 
warm_start − bool, optional, default = false With this parameter set to True, we can reuse the solution of the previous call to fit as initialization. If we choose default i.e. false, it will erase the previous solution. 
21 
average − iBoolean or int, optional, default = false It represents the number of CPUs to be used in OVA (One Versus All) computation, for multiclass problems. The default value is none which means 1. 
Attributes
Following table consist the attributes used by SGDClassifier module −
Sr.No  Attributes & Description 

1 
coef_ − array, shape (1, n_features) if n_classes==2, else (n_classes, n_features) This attribute provides the weight assigned to the features. 
2 
intercept_ − array, shape (1,) if n_classes==2, else (n_classes,) It represents the independent term in decision function. 
3 
n_iter_ − int It gives the number of iterations to reach the stopping criterion. 
Implementation Example
Like other classifiers, Stochastic Gradient Descent (SGD) has to be fitted with following two arrays −
An array X holding the training samples. It is of size [n_samples, n_features].
An array Y holding the target values i.e. class labels for the training samples. It is of size [n_samples].
Example
Following Python script uses SGDClassifier linear model −
import numpy as np from sklearn import linear_model X = np.array([[1, 1], [2, 1], [1, 1], [2, 1]]) Y = np.array([1, 1, 2, 2]) SGDClf = linear_model.SGDClassifier(max_iter = 1000, tol=1e3,penalty = "elasticnet") SGDClf.fit(X, Y)
Output
SGDClassifier( alpha = 0.0001, average = False, class_weight = None, early_stopping = False, epsilon = 0.1, eta0 = 0.0, fit_intercept = True, l1_ratio = 0.15, learning_rate = 'optimal', loss = 'hinge', max_iter = 1000, n_iter = None, n_iter_no_change = 5, n_jobs = None, penalty = 'elasticnet', power_t = 0.5, random_state = None, shuffle = True, tol = 0.001, validation_fraction = 0.1, verbose = 0, warm_start = False )
Example
Now, once fitted, the model can predict new values as follows −
SGDClf.predict([[2.,2.]])
Output
array([2])
Example
For the above example, we can get the weight vector with the help of following python script −
SGDClf.coef_
Output
array([[19.54811198, 9.77200712]])
Example
Similarly, we can get the value of intercept with the help of following python script −
SGDClf.intercept_
Output
array([10.])
Example
We can get the signed distance to the hyperplane by using SGDClassifier.decision_function as used in the following python script −
SGDClf.decision_function([[2., 2.]])
Output
array([68.6402382])
SGD Regressor
Stochastic Gradient Descent (SGD) regressor basically implements a plain SGD learning routine supporting various loss functions and penalties to fit linear regression models. Scikitlearn provides SGDRegressor module to implement SGD regression.
Parameters
Parameters used by SGDRegressor are almost same as that were used in SGDClassifier module. The difference lies in ‘loss’ parameter. For SGDRegressor modules’ loss parameter the positives values are as follows −
squared_loss − It refers to the ordinary least squares fit.
huber: SGDRegressor − correct the outliers by switching from squared to linear loss past a distance of epsilon. The work of ‘huber’ is to modify ‘squared_loss’ so that algorithm focus less on correcting outliers.
epsilon_insensitive − Actually, it ignores the errors less than epsilon.
squared_epsilon_insensitive − It is same as epsilon_insensitive. The only difference is that it becomes squared loss past a tolerance of epsilon.
Another difference is that the parameter named ‘power_t’ has the default value of 0.25 rather than 0.5 as in SGDClassifier. Furthermore, it doesn’t have ‘class_weight’ and ‘n_jobs’ parameters.
Attributes
Attributes of SGDRegressor are also same as that were of SGDClassifier module. Rather it has three extra attributes as follows −
average_coef_ − array, shape(n_features,)
As name suggest, it provides the average weights assigned to the features.
average_intercept_ − array, shape(1,)
As name suggest, it provides the averaged intercept term.
t_ − int
It provides the number of weight updates performed during the training phase.
Note − the attributes average_coef_ and average_intercept_ will work after enabling parameter ‘average’ to True.
Implementation Example
Following Python script uses SGDRegressor linear model −
import numpy as np from sklearn import linear_model n_samples, n_features = 10, 5 rng = np.random.RandomState(0) y = rng.randn(n_samples) X = rng.randn(n_samples, n_features) SGDReg =linear_model.SGDRegressor( max_iter = 1000,penalty = "elasticnet",loss = 'huber',tol = 1e3, average = True ) SGDReg.fit(X, y)
Output
SGDRegressor( alpha = 0.0001, average = True, early_stopping = False, epsilon = 0.1, eta0 = 0.01, fit_intercept = True, l1_ratio = 0.15, learning_rate = 'invscaling', loss = 'huber', max_iter = 1000, n_iter = None, n_iter_no_change = 5, penalty = 'elasticnet', power_t = 0.25, random_state = None, shuffle = True, tol = 0.001, validation_fraction = 0.1, verbose = 0, warm_start = False )
Example
Now, once fitted, we can get the weight vector with the help of following python script −
SGDReg.coef_
Output
array([0.00423314, 0.00362922, 0.00380136, 0.00585455, 0.00396787])
Example
Similarly, we can get the value of intercept with the help of following python script −
SGReg.intercept_
Output
SGReg.intercept_
Example
We can get the number of weight updates during training phase with the help of the following python script −
SGDReg.t_
Output
61.0
Pros and Cons of SGD
Following the pros of SGD −
Stochastic Gradient Descent (SGD) is very efficient.
It is very easy to implement as there are lots of opportunities for code tuning.
Following the cons of SGD −
Stochastic Gradient Descent (SGD) requires several hyperparameters like regularization parameters.
It is sensitive to feature scaling.