How can decision tree be used to construct a classifier in Python?


Decision tree is the basic building block of the random forest algorithm. It is considered as one of the most popular algorithms in machine learning and is used for classification purposes. They are extremely popular because they are easy to understand.

The decision given out by a decision tree can be used to explain why a certain prediction was made. This means the in and out of the process would be clear to the user.They are also a foundation for ensemble methods such as bagging, random forests, and gradient boosting. They are also known as CART, i.e. Classification And Regression Trees. It can be visualized as a binary tree (the one studied in data structures and algorithms).

Every node in the tree represents a single input variable, and the leaf nodes (which are also known as terminal nodes) contain output variable. These leaf nodes are used to make the prediction on the node. When a decision tree is being created, the basic idea is that the given space is being divided into multiple sections. All the values are put up and different splits are tried so as to attain less cost and best prediction values. These values are chosen in a greedy manner.

Splitting up of these nodes goes on until the maximum depth of the tree is reached. The idea behind using decision tree is to divide the input dataset into smaller dataset based on specific feature value until every target variable falls under one single category. This split is made so as to get the maximum information gain for every step.

Every decision tree begins with a root, and this is the place where the first split is made. An efficient way should be devised to ensure that the nodes are defined.

This is where Gini value comes into picture. Gini is considered to be one of the most commonly used measurement to measure inequality. Inequality refers to the target class (output) which every subset in a node may belong to.

Hence, the Gini value is calculated after every split. Based on the Gini value/ the inequality value, information gain can be defined.

DecisionTreeClassifier is used to perform multiclass classification.

Below is the syntax of the same.

class sklearn.tree.DecisionTreeClassifier(*, criterion='gini',…)

Following is the example −

Example

from sklearn import tree
from sklearn.model_selection import train_test_split
my_data = [[16,19],[17,32],[13,3],[14,5],[141,28],[13,34],[186,2],[126,25],[176,28],
[131,32],[166,6],[128,32],[79,110],[12,38],[19,91],[71,136],[116,25],[17,200], [15,25], [14,32],[13,35]]
target_vals =['Man','Woman','Man','Woman',
'Woman','Man','Woman','Woman',
'Woman','Woman','Woman','Man','Man',
'Man','Woman', 'Woman', 'Woman',
'Woman','Man','Woman','Woman']
data_feature_names = ['Feature_1','Feature_2']
X_train, X_test, y_train, y_test = train_test_split(my_data, target_vals, test_size = 0.2, random_state = 1)
clf = tree.DecisionTreeClassifier()
print("The decision tree classifier is being called")
DTclf = clf.fit(my_data,target_vals)
prediction = DTclf.predict([[135,29]])
print("The predicted value is ")
print(prediction)

Output

The decision tree classifier is being called
The predicted value is
['Woman']

Explanation

  • The required packages are imported into the environment.
  • The code is used to classify values of target values based on feature values.
  • The feature vector and target values are defined.
  • The data is split into training and testing set with the help of ‘train_test_split’ function.
  • The DecisionTreeClassifier is called and the data is fit to the model.
  • The ‘predict’ function is used to predict the values for the feature values.
  • The output is displayed on the console.

Updated on: 10-Dec-2020

84 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements