Scikit Learn - Randomized Decision Trees

This chapter will help you in understanding randomized decision trees in Sklearn.

Randomized Decision Tree algorithms

As we know that a DT is usually trained by recursively splitting the data, but being prone to overfit, they have been transformed to random forests by training many trees over various subsamples of the data. The sklearn.ensemble module is having following two algorithms based on randomized decision trees −

The Random Forest algorithm

For each feature under consideration, it computes the locally optimal feature/split combination. In Random forest, each decision tree in the ensemble is built from a sample drawn with replacement from the training set and then gets the prediction from each of them and finally selects the best solution by means of voting. It can be used for both classification as well as regression tasks.

Classification with Random Forest

For creating a random forest classifier, the Scikit-learn module provides sklearn.ensemble.RandomForestClassifier. While building random forest classifier, the main parameters this module uses are ‘max_features’ and ‘n_estimators’.

Here, ‘max_features’ is the size of the random subsets of features to consider when splitting a node. If we choose this parameter’s value to none then it will consider all the features rather than a random subset. On the other hand, n_estimators are the number of trees in the forest. The higher the number of trees, the better the result will be. But it will take longer to compute also.

Implementation example

In the following example, we are building a random forest classifier by using sklearn.ensemble.RandomForestClassifier and also checking its accuracy also by using cross_val_score module.

from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_blobs
from sklearn.ensemble import RandomForestClassifier
X, y = make_blobs(n_samples = 10000, n_features = 10, centers = 100,random_state = 0) RFclf = RandomForestClassifier(n_estimators = 10,max_depth = None,min_samples_split = 2, random_state = 0)
scores = cross_val_score(RFclf, X, y, cv = 5)
scores.mean()


Output

0.9997


Example

We can also use the sklearn dataset to build Random Forest classifier. As in the following example we are using iris dataset. We will also find its accuracy score and confusion matrix.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

path = "https://archive.ics.uci.edu/ml/machine-learning-database
s/iris/iris.data"
headernames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)
RFclf = RandomForestClassifier(n_estimators = 50)
RFclf.fit(X_train, y_train)
y_pred = RFclf.predict(X_test)
result = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(result)
result1 = classification_report(y_test, y_pred)
print("Classification Report:",)
print (result1)
result2 = accuracy_score(y_test,y_pred)
print("Accuracy:",result2)


Output

Confusion Matrix:
[[14 0 0]
[ 0 18 1]
[ 0 0 12]]
Classification Report:
precision recall f1-score support
Iris-setosa       1.00        1.00  1.00     14
Iris-versicolor   1.00        0.95  0.97     19
Iris-virginica    0.92        1.00  0.96     12

micro avg         0.98        0.98  0.98     45
macro avg         0.97        0.98  0.98     45
weighted avg      0.98        0.98  0.98     45

Accuracy: 0.9777777777777777


Regression with Random Forest

For creating a random forest regression, the Scikit-learn module provides sklearn.ensemble.RandomForestRegressor. While building random forest regressor, it will use the same parameters as used by sklearn.ensemble.RandomForestClassifier.

Implementation example

In the following example, we are building a random forest regressor by using sklearn.ensemble.RandomForestregressor and also predicting for new values by using predict() method.

from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
X, y = make_regression(n_features = 10, n_informative = 2,random_state = 0, shuffle = False)
RFregr = RandomForestRegressor(max_depth = 10,random_state = 0,n_estimators = 100)
RFregr.fit(X, y)


Output

RandomForestRegressor(
bootstrap = True, criterion = 'mse', max_depth = 10,
max_features = 'auto', max_leaf_nodes = None,
min_impurity_decrease = 0.0, min_impurity_split = None,
min_samples_leaf = 1, min_samples_split = 2,
min_weight_fraction_leaf = 0.0, n_estimators = 100, n_jobs = None,
oob_score = False, random_state = 0, verbose = 0, warm_start = False
)


Once fitted we can predict from regression model as follows −

print(RFregr.predict([[0, 2, 3, 0, 1, 1, 1, 1, 2, 2]]))


Output

[98.47729198]


Extra-Tree Methods

For each feature under consideration, it selects a random value for the split. The benefit of using extra tree methods is that it allows to reduce the variance of the model a bit more. The disadvantage of using these methods is that it slightly increases the bias.

Classification with Extra-Tree Method

For creating a classifier using Extra-tree method, the Scikit-learn module provides sklearn.ensemble.ExtraTreesClassifier. It uses the same parameters as used by sklearn.ensemble.RandomForestClassifier. The only difference is in the way, discussed above, they build trees.

Implementation example

In the following example, we are building a random forest classifier by using sklearn.ensemble.ExtraTreeClassifier and also checking its accuracy by using cross_val_score module.

from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_blobs
from sklearn.ensemble import ExtraTreesClassifier
X, y = make_blobs(n_samples = 10000, n_features = 10, centers=100,random_state = 0)
ETclf = ExtraTreesClassifier(n_estimators = 10,max_depth = None,min_samples_split = 10, random_state = 0)
scores = cross_val_score(ETclf, X, y, cv = 5)
scores.mean()


Output

1.0


Example

We can also use the sklearn dataset to build classifier using Extra-Tree method. As in the following example we are using Pima-Indian dataset.

from pandas import read_csv

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import ExtraTreesClassifier
path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
array = data.values
X = array[:,0:8]
Y = array[:,8]
seed = 7
kfold = KFold(n_splits=10, random_state=seed)
num_trees = 150
max_features = 5
ETclf = ExtraTreesClassifier(n_estimators=num_trees, max_features=max_features)
results = cross_val_score(ETclf, X, Y, cv=kfold)
print(results.mean())


Output

0.7551435406698566


Regression with Extra-Tree Method

For creating a Extra-Tree regression, the Scikit-learn module provides sklearn.ensemble.ExtraTreesRegressor. While building random forest regressor, it will use the same parameters as used by sklearn.ensemble.ExtraTreesClassifier.

Implementation example

In the following example, we are applying sklearn.ensemble.ExtraTreesregressor and on the same data as we used while creating random forest regressor. Let’s see the difference in the Output

from sklearn.ensemble import ExtraTreesRegressor
from sklearn.datasets import make_regression
X, y = make_regression(n_features = 10, n_informative = 2,random_state = 0, shuffle = False)
ETregr = ExtraTreesRegressor(max_depth = 10,random_state = 0,n_estimators = 100)
ETregr.fit(X, y)


Output

ExtraTreesRegressor(bootstrap = False, criterion = 'mse', max_depth = 10,
max_features = 'auto', max_leaf_nodes = None,
min_impurity_decrease = 0.0, min_impurity_split = None,
min_samples_leaf = 1, min_samples_split = 2,
min_weight_fraction_leaf = 0.0, n_estimators = 100, n_jobs = None,
oob_score = False, random_state = 0, verbose = 0, warm_start = False)


Example

Once fitted we can predict from regression model as follows −

print(ETregr.predict([[0, 2, 3, 0, 1, 1, 1, 1, 2, 2]]))


Output

[85.50955817]