CatBoost - Handling Categorical Features



Categorical features are variables that represent categories or labels rather than numerical values, so they are sometimes referred to as nominal or discrete features. These features are common in a wide range of real-world datasets and can be difficult to include into machine learning models.

We can classify categories into two main types −

  • Nominal Categorical features: These features represent categories with no basic order or ranking, like color, gender, and country. These features typically need special encoding, like Label encoding to be used in machine learning models.

  • Ordinal Categorical Features: These features represent categories having a meaningful order or ranking, like education level with categories such as high school, bachelor's degree, master's degree, and so on. These features can be written as integer numbers to show their order.

CatBoost Parameters

CatBoost is an advanced gradient boosting library that includes a huge number of parameters for changing and refining models. Let us have a look at some of the most commonly used CatBoost parameters −

  • learning_rate: As the loss function approaches its minimum, the step size is adjusted at each iteration. But it needs more iterations, a slower learning rate enhances training. The most common values range between 0.01 and 0.3.

  • iterations: It shows the total number of ensemble tree or boosting iterations. More iterations can boost model performance but they increase the possibility of overfitting. A few hundred to a few thousand is an average range.

  • l2_leaf_reg: It is the he weighted L2 regularization. Penalizing heavy weights on features helps to reduce overfitting. Adjusting this value improves regularization.

  • depth: It refers to the overall number of trees in the ensemble. It defines the level of complexity in each tree. While deeper trees can describe complicated relationships, they are more prone to overfitting. Values often vary between 4 and 10.

  • verbose: When set to True, it displays the training progress during iterations. If False, it works silently and does not print progress.

  • random_seed: It is the seed used by the random number generator. Setting this value ensures the reproducibility of the results.

  • one_hot_max_size: The maximum number of distinct categories that one-hot encoding of a categorical feature can support. CatBoost uses an effective approach to handle the feature differently if the number of unique categories exceeds this limit.

  • cat_features: A set of indices representing category features. CatBoost encodes these features for training purposes and handles them differently.

CatBoost with Categorical Features

Unlike many other machine learning models in which the categorical variables or features need to be manually encoded but CatBoost can handle categorical features directly. You just have to tell which features are categorical and CatBoost will handle them. So let us see how we can implement CatBoost with categorical features −

1. Install CatBoost Library

First you have to make sure that you have installed CatBoost library or not. If it is not installed then you can install it using the below command −

pip install catboost

Import the Libraries

After installing you are ready to use it in your model. So import necessary Libraries in your code like the below −

from catboost import CatBoostClassifier, Pool
from sklearn.model_selection import train_test_split

2. Load and Prepare the Data

You have to make sure that your dataset has categorical features. As we have discussed earlier, You do not need to manually encode them.

# Sample Data
data = [
   ['red', 1, 5.1],
   ['blue', 0, 3.5],
   ['green', 1, 4.7],
   ['blue', 0, 2.9],
   ['red', 1, 5.0]
]

3. Target Values and Split the Data

Now you have to target values and then convert it into the data frame. After converting into the data frame you have to split the data into testing and training datasets to test and train the model.

# Target values
labels = [1, 0, 1, 0, 1]

# Convert to DataFrame
import pandas as pd
df = pd.DataFrame(data, columns=['color', 'feature1', 'feature2'])

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(df, labels, test_size=0.2, random_state=42)

4. Identify Categorical Features

As we have seen earlier in this chapter that in CatBoost you have to tell the model which features are categorical. And this thing can be done by passing the index or name of the categorical features just like the below −

# Declare categorical features (column index or name)
categorical_features = ['color']

5. Train the Model

Now use the CatBoostClassifier or CatBoostRegressor of CatBoost for classification or regression tasks. Before doing it you have to pool object to specify the data, labels, and categorical features. See the code below −

# Pool object
train_pool = Pool(data=X_train, label=y_train, cat_features=categorical_features)
test_pool = Pool(data=X_test, label=y_test, cat_features=categorical_features)

# Initialize and train the model
model = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=6)
model.fit(train_pool)

# Predict on the test set
preds = model.predict(test_pool)

# Model accuracy
accuracy = (preds == y_test).mean()
print(f"Accuracy: {accuracy}")

Output

Here is the result of the above model we have created for Categorical Features in CatBoost −

0:	learn: 0.6869753	total: 59.5ms	remaining: 5.89s
1:	learn: 0.6794071	total: 61.6ms	remaining: 3.02s
2:	learn: 0.6632128	total: 61.8ms	remaining: 2s
. 
. 
. 
96:	learn: 0.2241489	total: 83.4ms	remaining: 2.58ms
97:	learn: 0.2228507	total: 83.5ms	remaining: 1.7ms
98:	learn: 0.2215656	total: 83.7ms	remaining: 845us
99:	learn: 0.2202937	total: 83.9ms	remaining: 0us
Accuracy: 1.0
Advertisements