What is OOB error?

Machine Learning Artificial Intelligence Data Science

Introduction

OOB or Out of Bag error and OOB Score is a term related to Random Forests. Random Forest is an ensemble of decision trees that improves the prediction from that of a single decision tree.OOB error is used to measure the error in the prediction of tree-based models like random forests, decision trees, and other ML models using the bagging method. In an OOB sample, the number of wrong classifications is an OOB error.

In this article let's explore OOB error/score. Before moving ahead let us a short overview of Random Forest and Decision Trees.

Random Forest Algorithm

Random Forest is an ensemble of decision trees. A decision tree model makes a prediction using a rule-based system of dividing the data based on features with simple decisions. Each such point where a decision is made becomes a node. When the predictions from many decision trees are combined it forms a Random Forest model. Random Forest is a bootstrapped aggregated model. Random forests are used for both regression and classification.

Random Forests are better as compared to Decision trees −

They are not sensitive to outliers
Can work with non-linear data
Less overfitting
Can work effectively on big datasets
Higher accuracy than other algorithms

OOB(Out Of Bag Score)

It is a performance metric for Random Forests. In Random Forests Algorithm some samples are not used in the training process. These are known as out-of-bag samples. These samples are used to test the model performance, which generates an OOB score. These samples are not seen by the model during the training process.

Out-of-bag(OOB) Error

The OOB error is calculated using the sci-kit learn package. It can give an estimate of the performance of the Random Forest model based on the OOB samples. For OOB calculation each decision tree is considered while choosing the samples that were not used in the training of the tree. Thus OOB error is calculated for each decision tree and averaged over all the trees to find the OOB error for the Random Forests Model.

Code Implementation using Scikit Learn

import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier

## dataset
data_x, data_y = make_classification(n_samples=5000,n_features=20,n_informative=10,n_classes=2)

## create model
model = RandomForestClassifier(n_estimators=200,oob_score=True)
model.fit(data_x, data_y)
err_OOB = 1 - model.oob_score_

print("err_OOB: {}".format(err_OOB))

Output

err_OOB: 0.088

Advantages: OOB Score/Error

It ensures a model with better prediction since the OOB samples on which the score is calculated have not been used in training the model and it is unseen by the model.
It has less variance and no overfitting since there is no exposure to data
The data can be tested while it is being trained so, the computation time is less for each testing set

Disadvantages: OOB Score/Error

The overall time for training the model may increase since the calculation of OOB will take sufficient time
It is suitable for small datasets since large datasets will consume more time.

Conclusion

OOB error is sometimes a winner as compared to other validation metrics concerning Random Forests. It gives better predictions and has a less overfitting problem. However, it takes a bit of time for large datasets.OOB helps in better prediction and to reduce overfitting in models while taking less computation time while testing the data. The popularity of calculating OOB errors extends to both tree-based models as well as other ML algorithms.

Mithilesh Pradhan

Updated on: 26-Sep-2023

185 Views

Kickstart Your Career

Get certified by completing the course

Get Started