How can decision tree be used to construct a classifier in Python?

Decision trees are one of the most intuitive and widely-used algorithms in machine learning for classification tasks. They work by recursively splitting the dataset based on feature values to create a tree-like model that makes predictions by following decision paths from root to leaf nodes.

How Decision Trees Work

A decision tree splits the input space into regions based on feature values. Each internal node represents a decision based on a feature, while leaf nodes contain the final prediction. The algorithm uses measures like Gini impurity to determine the best splits that maximize information gain.

The tree construction process continues until stopping criteria are met, such as maximum depth or minimum samples per leaf. This greedy approach ensures that each split provides the maximum reduction in impurity.

DecisionTreeClassifier Syntax

Scikit-learn provides the DecisionTreeClassifier class for building decision tree models ?

class sklearn.tree.DecisionTreeClassifier(
    criterion='gini',
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
    random_state=None
)

Example: Gender Classification

Let's build a decision tree classifier to predict gender based on two features ?

from sklearn import tree
from sklearn.model_selection import train_test_split

# Sample data with two features
my_data = [[16,19],[17,32],[13,3],[14,5],[141,28],[13,34],[186,2],[126,25],[176,28],
[131,32],[166,6],[128,32],[79,110],[12,38],[19,91],[71,136],[116,25],[17,200], 
[15,25], [14,32],[13,35]]

target_vals = ['Man','Woman','Man','Woman','Woman','Man','Woman','Woman','Woman',
'Woman','Woman','Man','Man','Man','Woman', 'Woman', 'Woman','Woman','Man','Woman','Woman']

data_feature_names = ['Feature_1','Feature_2']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(my_data, target_vals, test_size=0.2, random_state=1)

# Create and train the decision tree classifier
clf = tree.DecisionTreeClassifier(random_state=42)
print("Training the decision tree classifier...")
DTclf = clf.fit(X_train, y_train)

# Make predictions
test_prediction = DTclf.predict(X_test)
new_prediction = DTclf.predict([[135,29]])

print("Test set predictions:", test_prediction)
print("New sample prediction:", new_prediction)
Training the decision tree classifier...
Test set predictions: ['Woman' 'Man' 'Man' 'Woman']
New sample prediction: ['Woman']

Key Parameters

Important parameters for tuning decision tree performance ?

  • criterion − Split quality measure ('gini' or 'entropy')
  • max_depth − Maximum tree depth to prevent overfitting
  • min_samples_split − Minimum samples required to split a node
  • min_samples_leaf − Minimum samples required at leaf nodes
  • random_state − Controls randomness for reproducible results

Advantages and Limitations

Advantages Limitations
Easy to interpret and visualize Prone to overfitting
No need for feature scaling Can be unstable with small data changes
Handles both numerical and categorical data Bias toward features with many levels
Requires little data preparation May create overly complex trees

Conclusion

Decision trees provide an intuitive approach to classification with clear decision paths. While they can overfit, proper parameter tuning and ensemble methods like Random Forest can significantly improve their performance and robustness.

Updated on: 2026-03-25T13:15:07+05:30

258 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements