Scikit Learn - Decision Trees

In this chapter, we will learn about learning method in Sklearn which is termed as decision trees.

Decisions tress (DTs) are the most powerful non-parametric supervised learning method. They can be used for the classification and regression tasks. The main goal of DTs is to create a model predicting target variable value by learning simple decision rules deduced from the data features. Decision trees have two main entities; one is root node, where the data splits, and other is decision nodes or leaves, where we got final output.

Decision Tree Algorithms

Different Decision Tree algorithms are explained below −

ID3

It was developed by Ross Quinlan in 1986. It is also called Iterative Dichotomiser 3. The main goal of this algorithm is to find those categorical features, for every node, that will yield the largest information gain for categorical targets.

It lets the tree to be grown to their maximum size and then to improve the tree’s ability on unseen data, applies a pruning step. The output of this algorithm would be a multiway tree.

C4.5

It is the successor to ID3 and dynamically defines a discrete attribute that partition the continuous attribute value into a discrete set of intervals. That’s the reason it removed the restriction of categorical features. It converts the ID3 trained tree into sets of ‘IF-THEN’ rules.

In order to determine the sequence in which these rules should applied, the accuracy of each rule will be evaluated first.

C5.0

It works similar as C4.5 but it uses less memory and build smaller rulesets. It is more accurate than C4.5.

CART

It is called Classification and Regression Trees alsgorithm. It basically generates binary splits by using the features and threshold yielding the largest information gain at each node (called the Gini index).

Homogeneity depends upon Gini index, higher the value of Gini index, higher would be the homogeneity. It is like C4.5 algorithm, but, the difference is that it does not compute rule sets and does not support numerical target variables (regression) as well.

Classification with decision trees

In this case, the decision variables are categorical.

Sklearn Module − The Scikit-learn library provides the module name DecisionTreeClassifier for performing multiclass classification on dataset.

Parameters

Following table consist the parameters used by sklearn.tree.DecisionTreeClassifier module −

Sr.No	Parameter & Description
1	criterion − string, optional default= “gini” It represents the function to measure the quality of a split. Supported criteria are “gini” and “entropy”. The default is gini which is for Gini impurity while entropy is for the information gain.
2	splitter − string, optional default= “best” It tells the model, which strategy from “best” or “random” to choose the split at each node.
3	max_depth − int or None, optional default=None This parameter decides the maximum depth of the tree. The default value is None which means the nodes will expand until all leaves are pure or until all leaves contain less than min_smaples_split samples.
4	min_samples_split − int, float, optional default=2 This parameter provides the minimum number of samples required to split an internal node.
5	min_samples_leaf − int, float, optional default=1 This parameter provides the minimum number of samples required to be at a leaf node.
6	min_weight_fraction_leaf − float, optional default=0. With this parameter, the model will get the minimum weighted fraction of the sum of weights required to be at a leaf node.
7	max_features − int, float, string or None, optional default=None It gives the model the number of features to be considered when looking for the best split.
8	random_state − int, RandomState instance or None, optional, default = none This parameter represents the seed of the pseudo random number generated which is used while shuffling the data. Followings are the options − int − In this case, random_state is the seed used by random number generator. RandomState instance − In this case, random_state is the random number generator. None − In this case, the random number generator is the RandonState instance used by np.random.
9	max_leaf_nodes − int or None, optional default=None This parameter will let grow a tree with max_leaf_nodes in best-first fashion. The default is none which means there would be unlimited number of leaf nodes.
10	min_impurity_decrease − float, optional default=0. This value works as a criterion for a node to split because the model will split a node if this split induces a decrease of the impurity greater than or equal to min_impurity_decrease value.
11	min_impurity_split − float, default=1e-7 It represents the threshold for early stopping in tree growth.
12	class_weight − dict, list of dicts, “balanced” or None, default=None It represents the weights associated with classes. The form is {class_label: weight}. If we use the default option, it means all the classes are supposed to have weight one. On the other hand, if you choose class_weight: balanced, it will use the values of y to automatically adjust weights.
13	presort − bool, optional default=False It tells the model whether to presort the data to speed up the finding of best splits in fitting. The default is false but of set to true, it may slow down the training process.

Attributes

Following table consist the attributes used by sklearn.tree.DecisionTreeClassifier module −

Sr.No	Parameter & Description
1	feature_importances_ − array of shape =[n_features] This attribute will return the feature importance.
2	classes_: − array of shape = [n_classes] or a list of such arrays It represents the classes labels i.e. the single output problem, or a list of arrays of class labels i.e. multi-output problem.
3	max_features_ − int It represents the deduced value of max_features parameter.
4	n_classes_ − int or list It represents the number of classes i.e. the single output problem, or a list of number of classes for every output i.e. multi-output problem.
5	n_features_ − int It gives the number of features when fit() method is performed.
6	n_outputs_ − int It gives the number of outputs when fit() method is performed.

Methods

Following table consist the methods used by sklearn.tree.DecisionTreeClassifier module −

Sr.No	Parameter & Description
1	apply(self, X[, check_input]) This method will return the index of the leaf.
2	decision_path(self, X[, check_input]) As name suggests, this method will return the decision path in the tree
3	fit(self, X, y[, sample_weight, …]) fit() method will build a decision tree classifier from given training set (X, y).
4	get_depth(self) As name suggests, this method will return the depth of the decision tree
5	get_n_leaves(self) As name suggests, this method will return the number of leaves of the decision tree.
6	get_params(self[, deep]) We can use this method to get the parameters for estimator.
7	predict(self, X[, check_input]) It will predict class value for X.
8	predict_log_proba(self, X) It will predict class log-probabilities of the input samples provided by us, X.
9	predict_proba(self, X[, check_input]) It will predict class probabilities of the input samples provided by us, X.
10	score(self, X, y[, sample_weight]) As the name implies, the score() method will return the mean accuracy on the given test data and labels..
11	set_params(self, \\params) We can set the parameters of estimator with this method.

Implementation Example

The Python script below will use sklearn.tree.DecisionTreeClassifier module to construct a classifier for predicting male or female from our data set having 25 samples and two features namely ‘height’ and ‘length of hair’ −

from sklearn import tree
from sklearn.model_selection import train_test_split
X=[[165,19],[175,32],[136,35],[174,65],[141,28],[176,15]
,[131,32],[166,6],[128,32],[179,10],[136,34],[186,2],[12
6,25],[176,28],[112,38],[169,9],[171,36],[116,25],[196,2
5], [196,38], [126,40], [197,20], [150,25], [140,32],[136,35]]
Y=['Man','Woman','Woman','Man','Woman','Man','Woman','Ma
n','Woman','Man','Woman','Man','Woman','Woman','Woman','
Man','Woman','Woman','Man', 'Woman', 'Woman', 'Man', 'Man', 'Woman', 'Woman']
data_feature_names = ['height','length of hair']
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 1)
DTclf = tree.DecisionTreeClassifier()
DTclf = clf.fit(X,Y)
prediction = DTclf.predict([[135,29]])
print(prediction)

Output

['Woman']

We can also predict the probability of each class by using following python predict_proba() method as follows −

Example

prediction = DTclf.predict_proba([[135,29]])
print(prediction)

Output

[[0. 1.]]

Regression with decision trees

In this case the decision variables are continuous.

Sklearn Module − The Scikit-learn library provides the module name DecisionTreeRegressor for applying decision trees on regression problems.

Parameters

Parameters used by DecisionTreeRegressor are almost same as that were used in DecisionTreeClassifier module. The difference lies in ‘criterion’ parameter. For DecisionTreeRegressor modules ‘criterion: string, optional default= “mse”’ parameter have the following values −

mse − It stands for the mean squared error. It is equal to variance reduction as feature selectin criterion. It minimises the L2 loss using the mean of each terminal node.
freidman_mse − It also uses mean squared error but with Friedman’s improvement score.
mae − It stands for the mean absolute error. It minimizes the L1 loss using the median of each terminal node.

Another difference is that it does not have ‘class_weight’ parameter.

Attributes

Attributes of DecisionTreeRegressor are also same as that were of DecisionTreeClassifier module. The difference is that it does not have ‘classes_’ and ‘n_classes_’ attributes.

Methods

Methods of DecisionTreeRegressor are also same as that were of DecisionTreeClassifier module. The difference is that it does not have ‘predict_log_proba()’ and ‘predict_proba()’’ attributes.

Implementation Example

The fit() method in Decision tree regression model will take floating point values of y. let’s see a simple implementation example by using Sklearn.tree.DecisionTreeRegressor −

from sklearn import tree
X = [[1, 1], [5, 5]]
y = [0.1, 1.5]
DTreg = tree.DecisionTreeRegressor()
DTreg = clf.fit(X, y)

Once fitted, we can use this regression model to make prediction as follows −

DTreg.predict([[4, 5]])

Output

array([1.5])