CatBoost - Over-fitting Detection



Over-fitting is the term used to describe a model that performs well on training data but poorly on unknown data. CatBoost provides metrics to evaluate over-fitting.

Here are some common CatBoost metrics to explain over-fitting protection −

Cross-Validation

Cross-validation is one of the most significant machine learning methods for finding and lower over-fitting. By comparing training and validation performance cross validation helps in the detection of over-fitting.

This method analyses a CatBoostClassifier performance of the model using cross validation on the Iris dataset. Using cross validation is a common method to stop overfitting which is a situation where a model performs on new data because it is overtuned to training data. Furthermore, cross-validation improves our understanding of the real performance of the model. To determine the performance level of the model it determines a number of measures.

# Import required libraries
import numpy as np
from catboost import CatBoostClassifier, Pool, cv
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load the required dataset
iris = load_iris()
X, y = iris.data, iris.target

# Generate a CatBoost Pool object
data_pool = Pool(X, label=y)

# Defining the CatBoostClassifier parameters
params = {
	'iterations': 100,		
	'learning_rate': 0.1,	
	'depth': 6,				
	'loss_function': 'MultiClass', 
	'verbose': 0			 
}

# Do cross-validation
cv_results = cv(pool=data_pool, 
				params=params, 
				fold_count=5, 
				shuffle=True, 
				partition_random_seed=42, 
				verbose_eval=False)

# Print the results
for metric_name in cv_results.columns:
	if 'test-' in metric_name:
		mean_score = cv_results[metric_name].iloc[-1]
		print(f'{metric_name}: {mean_score:.4f}')

Output

This will produce the following result −

Training on fold [0/5]

bestTest = 0.1226007055
bestIteration = 72

Training on fold [1/5]

bestTest = 0.09388296402
bestIteration = 99

Training on fold [2/5]

bestTest = 0.05707644554
bestIteration = 99

Training on fold [3/5]

bestTest = 0.1341533772
bestIteration = 93

Training on fold [4/5]

bestTest = 0.19934632
bestIteration = 94

test-MultiClass-mean: 0.1221
test-MultiClass-std: 0.0531

Feature Importance

CatBoost also offers a feature importance score. It can be used for determining the features importance and usefulness as well as how they affect the prediction made by the model. Using feature importance over-fitting can be detected by highlighting variables that have an impact on the predictions of the model. If a feature has no impact on the model's predictions it can become over-fit to the model.

The resulting bar plot will highlight the importance of each feature in the predictions of the model. The features that are most important for the classification process can be identified by with the help of this data to rank engineering or feature selection jobs.

import numpy as np
import matplotlib.pyplot as plt
from catboost import CatBoostClassifier, Pool
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load the Iris dataset
iris_dataset = load_iris()
X_data, y_labels = iris_dataset.data, iris_dataset.target

# Split the data into training and testing sets
X_train_data, X_test_data, y_train_labels, y_test_labels = train_test_split(X_data, y_labels, test_size=0.2, random_state=42)

# Create a CatBoostClassifier
catboost_model = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=6, loss_function='MultiClass', verbose=0)

# Train the model
catboost_model.fit(X_train_data, y_train_labels)

# Create a Pool object for the testing data
test_data_pool = Pool(X_test_data)

# Get feature importance scores
feature_importance_scores = catboost_model.get_feature_importance(test_data_pool)

# Get feature names
feature_labels = iris_dataset.feature_names

# Plot feature importance with changed color
plt.figure(figsize=(10, 6))
plt.barh(range(len(feature_importance_scores)), feature_importance_scores, color='skyblue', tick_label=feature_labels)
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title('Feature Importance for CatBoost Classifier')
plt.show()

Output

Here is the result produced using the above code −

CatBoost Feature Importance

Identifying over-fitting also depends largely on the learning curve. There is not a different learning curve graphing technique in CatBoost. However, it can be plotted using other Python programs like matplotlib.

Advertisements