CatBoost - Classification Metrics

Quiz

CatBoost is used mostly for classification tasks. To classify something is to put it into a category. We analyze CatBoost's performance in terms of data classification using multiple criteria.

To classify data points into separate categories is the primary objective of classification problems. Multiple metrics are provided by CatBoost to evaluate model performance.

The performance of CatBoost's classification can be evaluated using the following required metrics −

Accuracy

This shows what percentage of the model's predictions were accurate. This is the overall number of accurate projections divided by the total number of predictions. However this measurement makes the most sense in this case, it may not be best for datasets with imbalances (one class significantly outnumbering the other).

To find the accuracy we will have to import libraries like numpy, catboost, sklearn.datasets and sklearn.model_selection.

import numpy as np
from catboost import CatBoostClassifier, Pool
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Loading the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Splitting the data into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a CatBoostClassifier 
model = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=6, loss_function='MultiClass', verbose=0)

# Create a Pool object 
train_pool = Pool(X_train, label=y_train)
test_pool = Pool(X_test, label=y_test)

# Train the model
model.fit(train_pool)

# Evaluate the model 
metrics = model.eval_metrics(test_pool, metrics=['Accuracy'], plot=True)

# Print the evaluation metrics
accuracy = metrics['Accuracy'][-1]

print(f'Accuracy is: {accuracy:.2f}')

Output

The result shows that the model is a ideal fit for the dataset and that it has successfully predicted every incidence in the dataset.

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))
Accuracy: 1.00

Multiclass Log Loss

Multiclass log loss, also known as cross-entropy for multiclass classification, is a variation of log loss designed for multiclass classification problems. This evaluates the point to which expected probabilities match with the actual class labels by predicting a probability distribution over many classes.

import numpy as np
from catboost import CatBoostClassifier, Pool
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Loading the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a CatBoostClassifier
model = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=6, loss_function='MultiClass', verbose=0)

# Create a Pool object
train_pool = Pool(X_train, label=y_train)
test_pool = Pool(X_test, label=y_test)

# Train the model
model.fit(train_pool)

# Evaluate the model for multi-class classification
metrics = model.eval_metrics(test_pool, metrics=['MultiClass'], plot = True)

# Print the evaluation metrics
multi_class_loss = metrics['MultiClass'][-1]

print(f'Multi-Class Loss: {multi_class_loss:.2f}')

Output

In the below result, a multi-class loss value of 0.03 indicates that the model is performing well in terms of multi-class classification on the test dataset.

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))
Multi-Class Loss: 0.03

Binary Log Loss

Binary Log loss measures the difference between the true labels and the predicted probability. Lower log loss values shows better performance. This meter is useful for making accurate predictions. And it can be used in situations needs for a more precise measurement of probability, like fraud detection or medical diagnosis. It is often brought up when talking about binary classification, which is when the dataset has only two classes.

Because the Iris dataset has three classes, it is not suitable for this metric. With the help of the Breast Cancer dataset we can see that it has just two classes: those that represent the presence and absence of breast cancer.

import numpy as np
from catboost import CatBoostClassifier, Pool
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# Loading the Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Splitting the data into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a CatBoostClassifier
model = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=6, verbose=0)

# Create a Pool object 
train_pool = Pool(X_train, label=y_train)
test_pool = Pool(X_test, label=y_test)

# Train the model
model.fit(train_pool)

# Evaluate the model
metrics = model.eval_metrics(test_pool, metrics=['Logloss'], plot =False)

# Print the evaluation metrics
logloss = metrics['Logloss'][-1]

print(f'Log Loss (Cross-Entropy): {logloss:.2f}')

Output

Here is the output of the above code −

Log Loss (Cross-Entropy): 0.08

AUC-ROC and AUC-PRC

Area Under the Receiver Operating Characteristic Curve (AUR-ROC) and Area Under the Precision-Recall Curve (AUC-PRC) are critical metrics for binary classification algorithms. AUC-ROC evaluates the model's ability to differentiate between positive and negative classifications whereas AUC-PRC focuses more closely on the trade-offs between precision and recall.

import catboost
from catboost import CatBoostClassifier, Pool
from sklearn import datasets
from sklearn.model_selection import train_test_split

# Loading the Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Converting to binary classification by mapping 
y_binary = (y == 2).astype(int)

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.2, random_state=42)

# Creating a CatBoost classifier with AUC-ROC metric
model = CatBoostClassifier(iterations=500, random_seed=42, eval_metric='AUC')

# Converting the training data into a CatBoost Pool
train_pool = Pool(X_train, label=y_train)

# Training the model
model.fit(train_pool, verbose=100)

validation_pool = Pool(X_test, label=y_test)
eval_result = model.eval_metrics(validation_pool, ['AUC'])['AUC']
metrics = model.eval_metrics(validation_pool, metrics=['PRAUC'],plot = True)
auc_pr = metrics['PRAUC'][-1]

# Print the evaluation metrics

print(f'AUC-PR: {auc_pr:.2f}')

print(f"AUC-ROC: {eval_result[-1]:.4f}")

Output

This will bring about the following outcome −

Learning rate set to 0.007867
0:	total: 2.09ms	remaining: 1.04s
100:	total: 42.3ms	remaining: 167ms
200:	total: 67.9ms	remaining: 101ms
300:	total: 89.8ms	remaining: 59.4ms
400:	total: 110ms	remaining: 27ms
499:	total: 129ms	remaining: 0us
MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))
AUC-PR: 1.00
AUC-ROC: 1.0000

F1 Score

The F1 Score are the model's accuracy, or how well it predicts a category, and recall, or how often it was able to identify that category. To balance the trade-off between false positives and false negatives, this statistic is perfect. Also we need to keep in mind that the better models tend to have higher F1 Scores.

import numpy as np
from catboost import CatBoostClassifier, Pool
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# Loading the Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Splitting the data into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating a CatBoostClassifier
model = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=6, verbose=0)

# Creating a Pool object 
train_pool = Pool(X_train, label=y_train)
test_pool = Pool(X_test, label=y_test)

# Train the model
model.fit(train_pool)

# Evaluate the model 
metrics = model.eval_metrics(test_pool, metrics=['F1'], plot=True)

# Print the evaluation metrics
f1 = metrics['F1'][-1]

print(f'F1 Score: {f1:.2f}')

Output

This will lead to the following outcome −

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))
F1 Score: 0.98

Summary

In summary, CatBoost provides a wide array of metrics and evaluation tools that greatly simplify the model selection and evaluation process. It starts with its built-in evaluation measures for classification tasks, like Mean Squared Error and Logloss but it also allows customisation with user-defined metrics. A complete evaluation is ensured by the capacity to use early stopping, cross-validate, and monitor many metrics throughout training. It offers a wide range of metrics for tasks needing classification.

Print Page