How to Use Python for Ensemble Learning?

Python Server Side Programming Programming

Ensemble learning is a machine learning technique that combines the predictions of multiple models to improve the overall performance of the model. The idea behind ensemble learning is to train multiple models independently and then combine their predictions to make a final prediction. This approach can lead to better performance than using a single model, as it can reduce overfitting and improve the generalization of the model.

Ensemble learning is widely used in machine learning and has been successful in many applications, including image classification, speech recognition, and natural language processing. It is a powerful tool for improving the performance of machine learning models and is worth considering when working on a machine learning problem.

In this tutorial, we will discuss four types of ensembles learning methods namely bagging, boosting, stacking, and voting along with their implementation in Python programming language.

Bagging

Bagging is a method of ensemble learning where multiple models are trained on different subsets of the data, and the final prediction is made by averaging the predictions of all the models. One popular bagging algorithm is the random forest, which trains a decision tree on each subset of the data.

In Python, you can use the RandomForestClassifier or RandomForestRegressor classes from the sklearn.ensemble module to train a random forest model.

Python Implementation

Given below is an example of using the bagging ensemble method with the random forest algorithm on the Iris dataset, which contains information about three species of iris flowers.

First, let's start by loading the necessary libraries and the dataset −

import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
df = pd.read_csv(r'C:\Users\Leekha\Desktop\ML Datasets\iris.csv')

# Split the data into features and target
X = df.drop('Species', axis=1)
y = df['Species']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Next, let's create the base model. We will use the RandomForestClassifier class from the sklearn.ensemble module to create the base model −

from sklearn.ensemble import RandomForestClassifier
# Create the base model
base_model = RandomForestClassifier(random_state=42)

Now, let's create the bagging ensemble using the BaggingClassifier class from the sklearn.ensemble module −

from sklearn.ensemble import BaggingClassifier

# Create the bagging ensemble
ensemble = BaggingClassifier(base_estimator=base_model, n_estimators=10)

Finally, let's train the ensemble on the training data and make predictions on the test data −

# Train the ensemble on the training data
ensemble.fit(X_train, y_train)

# Make predictions on the test data
y_pred = ensemble.predict(X_test)

To evaluate the performance of the ensemble, we can use the classification_report function from the sklearn.metrics module, which prints the precision, recall, and F1 score for each class −

from sklearn.metrics import classification_report

# Print the classification report
print(classification_report(y_test, y_pred))

This will print the following evaluation metrics for the bagging ensemble on the test data.

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        11
  versicolor       0.92      1.00      0.96        10
   virginica       1.00      0.91      0.95        11

    accuracy                           0.97        32
   macro avg       0.97      0.97      0.97        32
weighted avg       0.97      0.97      0.97        32

Boosting

Boosting is a method of ensemble learning where multiple models are trained sequentially, with each model trying to correct the mistakes of the previous model. One popular boosting algorithm is gradient boosting, which trains a series of decision trees on the data. In Python, you can use the GradientBoostingClassifier or GradientBoostingRegressor classes from the sklearn.ensemble module to train a gradient boosting model.

Python Implementation

Given below is an example of using the boosting ensemble method with the AdaBoost algorithm on the Iris dataset −

First, let's start by loading the necessary libraries and the dataset −

import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
df = pd.read_csv(r'C:\Users\Leekha\Desktop\ML Datasets\iris.csv')

# Split the data into features and target
X = df.drop('species', axis=1)
y = df['species']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Next, let's create the base model. We will use the DecisionTreeClassifier class from the sklearn.tree module to create the base model −

from sklearn.tree import DecisionTreeClassifier

# Create the base model
base_model = DecisionTreeClassifier(random_state=42)

Now, let's create the boosting ensemble using the AdaBoostClassifier class from the sklearn.ensemble module −

from sklearn.ensemble import AdaBoostClassifier

# Create the boosting ensemble
ensemble = AdaBoostClassifier(base_estimator=base_model, n_estimators=10)

Finally, let's train the ensemble on the training data and make predictions on the test data −

# Train the ensemble on the training data
ensemble.fit(X_train, y_train)

# Make predictions on the test data
y_pred = ensemble.predict(X_test)

To evaluate the performance of the ensemble, we can use the classification_report function from the sklearn.metrics module, which prints the precision, recall, and F1 score for each class −

from sklearn.metrics import classification_report

# Print the classification report
print(classification_report(y_test, y_pred))

This will print the evaluation metrics for the boosting ensemble on the test data −

 precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        11
  versicolor       0.92      1.00      0.96        10
   virginica       1.00      0.91      0.95        11

    accuracy                           0.97        32
   macro avg       0.97      0.97      0.97        32
weighted avg       0.97      0.97      0.97        32

Stacking

Stacking is a method of ensemble learning where multiple models are trained on the same data, and the predictions of these models are used as input to train a higher-level model. In Python, you can use the StackingClassifier or StackingRegressor classes from the mlxtend.classifier module to train a stacking model.

Python Implementation

Given below is an example of using the stacking ensemble method with a decision tree as the higher-level model on the Iris dataset, with the output included. We will use a logistic regression and a support vector machine as the base models.

First, let's start by loading the necessary libraries and the dataset −

import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
df = pd.read_csv('iris.csv')

# Split the data into features and target
X = df.drop('species', axis=1)
y = df['species']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Next, let's create the base models and the higher-level model. We will use the LogisticRegression and SVC classes from the sklearn.linear_model and sklearn.svm modules, respectively, to create the base models, and the RandomForestClassifier class from the sklearn.ensemble module to create the higher-level model −

from mlxtend.classifier import StackingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# Create the base models
logistic_regression = LogisticRegression(random_state=42)
svm = SVC(random_state=42)

# Create the higher-level model
dt = RandomForestClassifier(random_state=42)

Now, let's create the stacking ensemble using the StackingClassifier class from the mlxtend.classifier module −

# Create the stacking ensemble
ensemble = StackingClassifier(
   classifiers=[logistic_regression, svm], 
   meta_classifier=dt, use_probas=True, average_probas=False
)

Finally, let's train the ensemble on the training data and make predictions on the test data −

# Train the ensemble on the training data
ensemble.fit(X_train, y_train)

# Make predictions on the test data
y_pred = ensemble.predict(X_test)

To evaluate the performance of the ensemble, we can use the classification_report function from the sklearn.metrics module, which prints the precision, recall, and F1 score for each class −

from sklearn.metrics import classification_report

# Print the classification report
print(classification_report(y_test, y_pred))

This will print the evaluation metrics for the stacking ensemble on the test data −

precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        11
  versicolor       0.90      0.90      0.90        10
   virginica       0.91      0.91      0.91        11

    accuracy                           0.93        32
   macro avg       0.93      0.93

Voting

Voting is a method of ensemble learning where multiple models are trained on the same data, and the final prediction is made by majority vote. In Python, you can use the VotingClassifier class from the sklearn.ensemble module to train a voting ensemble model.

Python Implementation

Given below is an example of using the voting ensemble method with a logistic regression, a support vector machine, and a decision tree as the base models on the Iris dataset.

First, let's start by loading the necessary libraries and the dataset −

import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
df = pd.read_csv('iris.csv')

# Split the data into features and target
X = df.drop('species', axis=1)
y = df['species']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Next, let's create the base models. We will use the LogisticRegression, SVC, and DecisionTreeClassifier classes from the sklearn.linear_model, sklearn.svm, and sklearn.tree modules, respectively, to create the base models −

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

# Create the base models
logistic_regression = LogisticRegression(random_state=42)
svm = SVC(random_state=42)
dt = DecisionTreeClassifier(random_state=42)

Now, let's create the voting ensemble using the VotingClassifier class from the sklearn.ensemble module −

from sklearn.ensemble import VotingClassifier

# Create the voting ensemble
ensemble = VotingClassifier(estimators=[('lr', logistic_regression), 
   ('svm', svm), 
   ('dt', dt)], voting='hard')

Finally, let's train the ensemble on the training data and make predictions on the test data −

# Train the ensemble on the training data
ensemble.fit(X_train, y_train)

# Make predictions on the test data
y_pred = ensemble.predict(X_test)

To evaluate the performance of the ensemble, we can use the classification_report function from the sklearn.metrics module, which prints the precision, recall, and F1 score for each class −

from sklearn.metrics import classification_report

# Print the classification report
print(classification_report(y_test, y_pred))

This will print the evaluation metrics for the voting ensemble on the test data −

precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        11
  versicolor       0.92      1.00      0.96        10
   virginica       1.00      0.91      0.95        11

    accuracy                           0.97        32
   macro avg       0.97      0.97      0.97        32
weighted avg       0.97      0.97      0.97        32

Conclusion

In this tutorial, we explain how you can use Python for ensemble learning and provided examples of using various methods such as bagging, boosting, and stacking on the Iris dataset.

We demonstrated how ensemble learning can lead to better performance than using a single model and can be a useful technique for improving the performance of machine learning models.

Gaurav Leekha

Updated on: 20-Feb-2024

1 Views

Kickstart Your Career

Get certified by completing the course

Get Started