- Scikit Learn Tutorial
- Scikit Learn - Home
- Scikit Learn - Introduction
- Scikit Learn - Modelling Process
- Scikit Learn - Data Representation
- Scikit Learn - Estimator API
- Scikit Learn - Conventions
- Scikit Learn - Linear Modeling
- Scikit Learn - Extended Linear Modeling
- Stochastic Gradient Descent
- Scikit Learn - Support Vector Machines
- Scikit Learn - Anomaly Detection
- Scikit Learn - K-Nearest Neighbors
- Scikit Learn - KNN Learning
- Classification with Naïve Bayes
- Scikit Learn - Decision Trees
- Randomized Decision Trees
- Scikit Learn - Boosting Methods
- Scikit Learn - Clustering Methods
- Clustering Performance Evaluation
- Dimensionality Reduction using PCA
- Scikit Learn Useful Resources
- Scikit Learn - Quick Guide
- Scikit Learn - Useful Resources
- Scikit Learn - Discussion

- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who

# Scikit Learn - Boosting Methods

In this chapter, we will learn about the boosting methods in Sklearn, which enables building an ensemble model.

Boosting methods build ensemble model in an increment way. The main principle is to build the model incrementally by training each base model estimator sequentially. In order to build powerful ensemble, these methods basically combine several week learners which are sequentially trained over multiple iterations of training data. The sklearn.ensemble module is having following two boosting methods.

## AdaBoost

It is one of the most successful boosting ensemble method whose main key is in the way they give weights to the instances in dataset. That’s why the algorithm needs to pay less attention to the instances while constructing subsequent models.

### Classification with AdaBoost

For creating a AdaBoost classifier, the Scikit-learn module provides **sklearn.ensemble.AdaBoostClassifier**. While building this classifier, the main parameter this module use is **base_estimator**. Here, base_estimator is the value of the **base estimator** from which the boosted ensemble is built. If we choose this parameter’s value to none then, the base estimator would be **DecisionTreeClassifier(max_depth=1)**.

### Implementation example

In the following example, we are building a AdaBoost classifier by using **sklearn.ensemble.AdaBoostClassifier** and also predicting and checking its score.

from sklearn.ensemble import AdaBoostClassifier from sklearn.datasets import make_classification X, y = make_classification(n_samples = 1000, n_features = 10,n_informative = 2, n_redundant = 0,random_state = 0, shuffle = False) ADBclf = AdaBoostClassifier(n_estimators = 100, random_state = 0) ADBclf.fit(X, y)

### Output

AdaBoostClassifier(algorithm = 'SAMME.R', base_estimator = None, learning_rate = 1.0, n_estimators = 100, random_state = 0)

### Example

Once fitted, we can predict for new values as follows −

print(ADBclf.predict([[0, 2, 3, 0, 1, 1, 1, 1, 2, 2]]))

### Output

[1]

### Example

Now we can check the score as follows −

ADBclf.score(X, y)

### Output

0.995

### Example

We can also use the sklearn dataset to build classifier using Extra-Tree method. For example, in an example given below, we are using Pima-Indian dataset.

from pandas import read_csv from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.ensemble import AdaBoostClassifier path = r"C:\pima-indians-diabetes.csv" headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] data = read_csv(path, names = headernames) array = data.values X = array[:,0:8] Y = array[:,8] seed = 5 kfold = KFold(n_splits = 10, random_state = seed) num_trees = 100 max_features = 5 ADBclf = AdaBoostClassifier(n_estimators = num_trees, max_features = max_features) results = cross_val_score(ADBclf, X, Y, cv = kfold) print(results.mean())

### Output

0.7851435406698566

### Regression with AdaBoost

For creating a regressor with Ada Boost method, the Scikit-learn library provides **sklearn.ensemble.AdaBoostRegressor**. While building regressor, it will use the same parameters as used by **sklearn.ensemble.AdaBoostClassifier**.

### Implementation example

In the following example, we are building a AdaBoost regressor by using **sklearn.ensemble.AdaBoostregressor** and also predicting for new values by using predict() method.

from sklearn.ensemble import AdaBoostRegressor from sklearn.datasets import make_regression X, y = make_regression(n_features = 10, n_informative = 2,random_state = 0, shuffle = False) ADBregr = RandomForestRegressor(random_state = 0,n_estimators = 100) ADBregr.fit(X, y)

### Output

AdaBoostRegressor(base_estimator = None, learning_rate = 1.0, loss = 'linear', n_estimators = 100, random_state = 0)

### Example

Once fitted we can predict from regression model as follows −

print(ADBregr.predict([[0, 2, 3, 0, 1, 1, 1, 1, 2, 2]]))

### Output

[85.50955817]

## Gradient Tree Boosting

It is also called **Gradient Boosted Regression Trees** (GRBT). It is basically a generalization of boosting to arbitrary differentiable loss functions. It produces a prediction model in the form of an ensemble of week prediction models. It can be used for the regression and classification problems. Their main advantage lies in the fact that they naturally handle the mixed type data.

### Classification with Gradient Tree Boost

For creating a Gradient Tree Boost classifier, the Scikit-learn module provides **sklearn.ensemble.GradientBoostingClassifier**. While building this classifier, the main parameter this module use is ‘loss’. Here, ‘loss’ is the value of loss function to be optimized. If we choose loss = deviance, it refers to deviance for classification with probabilistic outputs.

On the other hand, if we choose this parameter’s value to exponential then it recovers the AdaBoost algorithm. The parameter **n_estimators** will control the number of week learners. A hyper-parameter named **learning_rate** (in the range of (0.0, 1.0]) will control overfitting via shrinkage.

### Implementation example

In the following example, we are building a Gradient Boosting classifier by using **sklearn.ensemble.GradientBoostingClassifier**. We are fitting this classifier with 50 week learners.

from sklearn.datasets import make_hastie_10_2 from sklearn.ensemble import GradientBoostingClassifier X, y = make_hastie_10_2(random_state = 0) X_train, X_test = X[:5000], X[5000:] y_train, y_test = y[:5000], y[5000:] GDBclf = GradientBoostingClassifier(n_estimators = 50, learning_rate = 1.0,max_depth = 1, random_state = 0).fit(X_train, y_train) GDBclf.score(X_test, y_test)

### Output

0.8724285714285714

### Example

We can also use the sklearn dataset to build classifier using Gradient Boosting Classifier. As in the following example we are using Pima-Indian dataset.

from pandas import read_csv from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.ensemble import GradientBoostingClassifier path = r"C:\pima-indians-diabetes.csv" headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] data = read_csv(path, names = headernames) array = data.values X = array[:,0:8] Y = array[:,8] seed = 5 kfold = KFold(n_splits = 10, random_state = seed) num_trees = 100 max_features = 5 ADBclf = GradientBoostingClassifier(n_estimators = num_trees, max_features = max_features) results = cross_val_score(ADBclf, X, Y, cv = kfold) print(results.mean())

### Output

0.7946582356674234

### Regression with Gradient Tree Boost

For creating a regressor with Gradient Tree Boost method, the Scikit-learn library provides **sklearn.ensemble.GradientBoostingRegressor**. It can specify the loss function for regression via the parameter name loss. The default value for loss is ‘ls’.

### Implementation example

In the following example, we are building a Gradient Boosting regressor by using **sklearn.ensemble.GradientBoostingregressor** and also finding the mean squared error by using mean_squared_error() method.

import numpy as np from sklearn.metrics import mean_squared_error from sklearn.datasets import make_friedman1 from sklearn.ensemble import GradientBoostingRegressor X, y = make_friedman1(n_samples = 2000, random_state = 0, noise = 1.0) X_train, X_test = X[:1000], X[1000:] y_train, y_test = y[:1000], y[1000:] GDBreg = GradientBoostingRegressor(n_estimators = 80, learning_rate=0.1, max_depth = 1, random_state = 0, loss = 'ls').fit(X_train, y_train)

Once fitted we can find the mean squared error as follows −

mean_squared_error(y_test, GDBreg.predict(X_test))

### Output

5.391246106657164