Scikit Learn - Logistic Regression



Logistic regression, despite its name, is a classification algorithm rather than regression algorithm. Based on a given set of independent variables, it is used to estimate discrete value (0 or 1, yes/no, true/false). It is also called logit or MaxEnt Classifier.

Basically, it measures the relationship between the categorical dependent variable and one or more independent variables by estimating the probability of occurrence of an event using its logistics function.

sklearn.linear_model.LogisticRegression is the module used to implement logistic regression.

Parameters

Following table lists the parameters used by Logistic Regression module −

Sr.No Parameter & Description
1

penalty − str, ‘L1’, ‘L2’, ‘elasticnet’ or none, optional, default = ‘L2’

This parameter is used to specify the norm (L1 or L2) used in penalization (regularization).

2

dual − Boolean, optional, default = False

It is used for dual or primal formulation whereas dual formulation is only implemented for L2 penalty.

3

tol − float, optional, default=1e-4

It represents the tolerance for stopping criteria.

4

C − float, optional, default=1.0

It represents the inverse of regularization strength, which must always be a positive float.

5

fit_intercept − Boolean, optional, default = True

This parameter specifies that a constant (bias or intercept) should be added to the decision function.

6

intercept_scaling − float, optional, default = 1

This parameter is useful when

  • the solver ‘liblinear’ is used

  • fit_intercept is set to true

7

class_weight − dict or ‘balanced’ optional, default = none

It represents the weights associated with classes. If we use the default option, it means all the classes are supposed to have weight one. On the other hand, if you choose class_weight: balanced, it will use the values of y to automatically adjust weights.

8

random_state − int, RandomState instance or None, optional, default = none

This parameter represents the seed of the pseudo random number generated which is used while shuffling the data. Followings are the options

  • int − in this case, random_state is the seed used by random number generator.

  • RandomState instance − in this case, random_state is the random number generator.

  • None − in this case, the random number generator is the RandonState instance used by np.random.

9

solver − str, {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘saag’, ‘saga’}, optional, default = ‘liblinear’

This parameter represents which algorithm to use in the optimization problem. Followings are the properties of options under this parameter −

  • liblinear − It is a good choice for small datasets. It also handles L1 penalty. For multiclass problems, it is limited to one-versus-rest schemes.

  • newton-cg − It handles only L2 penalty.

  • lbfgs − For multiclass problems, it handles multinomial loss. It also handles only L2 penalty.

  • saga − It is a good choice for large datasets. For multiclass problems, it also handles multinomial loss. Along with L1 penalty, it also supports ‘elasticnet’ penalty.

  • sag − It is also used for large datasets. For multiclass problems, it also handles multinomial loss.

10

max_iter − int, optional, default = 100

As name suggest, it represents the maximum number of iterations taken for solvers to converge.

11

multi_class − str, {‘ovr’, ‘multinomial’, ‘auto’}, optional, default = ‘ovr’

  • ovr − For this option, a binary problem is fit for each label.

  • multimonial − For this option, the loss minimized is the multinomial loss fit across the entire probability distribution. We can’t use this option if solver = ‘liblinear’.

  • auto − This option will select ‘ovr’ if solver = ‘liblinear’ or data is binary, else it will choose ‘multinomial’.

12

verbose − int, optional, default = 0

By default, the value of this parameter is 0 but for liblinear and lbfgs solver we should set verbose to any positive number.

13

warm_start − bool, optional, default = false

With this parameter set to True, we can reuse the solution of the previous call to fit as initialization. If we choose default i.e. false, it will erase the previous solution.

14

n_jobs − int or None, optional, default = None

If multi_class = ‘ovr’, this parameter represents the number of CPU cores used when parallelizing over classes. It is ignored when solver = ‘liblinear’.

15

l1_ratio − float or None, optional, dgtefault = None

It is used in case when penalty = ‘elasticnet’. It is basically the Elastic-Net mixing parameter with 0 < = l1_ratio > = 1.

Attributes

Followings table consist the attributes used by Logistic Regression module −

Sr.No Attributes & Description
1

coef_ − array, shape(n_features,) or (n_classes, n_features)

It is used to estimate the coefficients of the features in the decision function. When the given problem is binary, it is of the shape (1, n_features).

2

Intercept_ − array, shape(1) or (n_classes)

It represents the constant, also known as bias, added to the decision function.

3

classes_ − array, shape(n_classes)

It will provide a list of class labels known to the classifier.

4

n_iter_ − array, shape (n_classes) or (1)

It returns the actual number of iterations for all the classes.

Implementation Example

Following Python script provides a simple example of implementing logistic regression on iris dataset of scikit-learn −

from sklearn import datasets
from sklearn import linear_model
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y = True)
LRG = linear_model.LogisticRegression(
   random_state = 0,solver = 'liblinear',multi class = 'auto'
)
.fit(X, y)
LRG.score(X, y)

Output

0.96

The output shows that the above Logistic Regression model gave the accuracy of 96 percent.

Advertisements