Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to Use Python for Ensemble Learning?
Ensemble learning is a machine learning technique that combines predictions from multiple models to improve overall performance. By training multiple models independently and combining their predictions, ensemble methods can reduce overfitting and improve generalization compared to single models.
This approach has proven successful in applications like image classification, speech recognition, and natural language processing. In this tutorial, we'll explore four ensemble learning methods: bagging, boosting, stacking, and voting with Python implementations.
Bagging
Bagging (Bootstrap Aggregating) trains multiple models on different subsets of data and averages their predictions. Random Forest is a popular bagging algorithm that trains decision trees on data subsets.
Example
Here's how to implement bagging using Random Forest on the Iris dataset ?
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.ensemble import BaggingClassifier, RandomForestClassifier from sklearn.metrics import classification_report # Load the Iris dataset iris = load_iris() X, y = iris.data, iris.target # Split into training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Create base model and bagging ensemble base_model = RandomForestClassifier(random_state=42) bagging_ensemble = BaggingClassifier(estimator=base_model, n_estimators=10, random_state=42) # Train and predict bagging_ensemble.fit(X_train, y_train) y_pred = bagging_ensemble.predict(X_test) # Evaluate performance print(classification_report(y_test, y_pred, target_names=iris.target_names))
precision recall f1-score support
setosa 1.00 1.00 1.00 19
versicolor 1.00 0.94 0.97 16
virginica 0.93 1.00 0.97 10
accuracy 0.98 45
macro avg 0.98 0.98 0.98 45
weighted avg 0.98 0.98 0.98 45
Boosting
Boosting trains models sequentially, where each model attempts to correct the errors of previous models. AdaBoost is a popular boosting algorithm that adjusts weights based on misclassified examples.
Example
Here's boosting implementation using AdaBoost ?
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.ensemble import AdaBoostClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report # Load dataset and split iris = load_iris() X, y = iris.data, iris.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Create base model and boosting ensemble base_model = DecisionTreeClassifier(max_depth=1, random_state=42) boosting_ensemble = AdaBoostClassifier(estimator=base_model, n_estimators=50, random_state=42) # Train and predict boosting_ensemble.fit(X_train, y_train) y_pred = boosting_ensemble.predict(X_test) # Evaluate print(classification_report(y_test, y_pred, target_names=iris.target_names))
precision recall f1-score support
setosa 1.00 1.00 1.00 19
versicolor 0.94 1.00 0.97 16
virginica 1.00 0.90 0.95 10
accuracy 0.98 45
macro avg 0.98 0.97 0.97 45
weighted avg 0.98 0.98 0.98 45
Stacking
Stacking trains multiple base models and uses their predictions as input to a meta-learner. This creates a two-level learning architecture where the meta-model learns how to best combine base model predictions.
Example
Here's stacking implementation using scikit-learn's StackingClassifier ?
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import StackingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import classification_report
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create base models
base_models = [
('lr', LogisticRegression(random_state=42, max_iter=200)),
('svm', SVC(random_state=42, probability=True)),
('rf', RandomForestClassifier(random_state=42))
]
# Create meta-learner
meta_learner = LogisticRegression(random_state=42)
# Create stacking ensemble
stacking_ensemble = StackingClassifier(estimators=base_models, final_estimator=meta_learner, cv=5)
# Train and predict
stacking_ensemble.fit(X_train, y_train)
y_pred = stacking_ensemble.predict(X_test)
# Evaluate
print(classification_report(y_test, y_pred, target_names=iris.target_names))
precision recall f1-score support
setosa 1.00 1.00 1.00 19
versicolor 1.00 1.00 1.00 16
virginica 1.00 1.00 1.00 10
accuracy 1.00 45
macro avg 1.00 1.00 1.00 45
weighted avg 1.00 1.00 1.00 45
Voting
Voting combines predictions from multiple models using either hard voting (majority vote) or soft voting (average of predicted probabilities). It's the simplest ensemble method.
Example
Here's voting ensemble implementation ?
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import VotingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import classification_report
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create base models
models = [
('lr', LogisticRegression(random_state=42, max_iter=200)),
('svm', SVC(random_state=42, probability=True)),
('rf', RandomForestClassifier(random_state=42))
]
# Create voting ensemble (soft voting)
voting_ensemble = VotingClassifier(estimators=models, voting='soft')
# Train and predict
voting_ensemble.fit(X_train, y_train)
y_pred = voting_ensemble.predict(X_test)
# Evaluate
print(classification_report(y_test, y_pred, target_names=iris.target_names))
precision recall f1-score support
setosa 1.00 1.00 1.00 19
versicolor 1.00 1.00 1.00 16
virginica 1.00 1.00 1.00 10
accuracy 1.00 45
macro avg 1.00 1.00 1.00 45
weighted avg 1.00 1.00 1.00 45
Comparison of Ensemble Methods
| Method | Training | Combination | Best For |
|---|---|---|---|
| Bagging | Parallel | Average/Vote | Reducing variance |
| Boosting | Sequential | Weighted combination | Reducing bias |
| Stacking | Two-level | Meta-learner | Complex patterns |
| Voting | Independent | Vote/Average | Simple combination |
Conclusion
Ensemble learning significantly improves model performance by combining multiple algorithms. Choose bagging for variance reduction, boosting for bias reduction, stacking for complex relationships, and voting for simple model combination.
