
- CatBoost - Home
- CatBoost - Overview
- CatBoost - Architecture
- CatBoost - Installation
- CatBoost - Features
- CatBoost - Decision Trees
- CatBoost - Boosting Process
- CatBoost - Core Parameters
- CatBoost - Data Preprocessing
- CatBoost - Handling Categorical Features
- CatBoost - Handling Missing Values
- CatBoost - Classifier
- CatBoost - Model Training
- CatBoost - Metrics for Model Evaluation
- CatBoost - Classification Metrics
- CatBoost - Over-fitting Detection
- CatBoost vs Other Boosting Algorithms
- CatBoost Useful Resources
- CatBoost - Quick Guide
- CatBoost - Useful Resources
- CatBoost - Discussion
CatBoost - Handling Categorical Features
Categorical features are variables that represent categories or labels rather than numerical values, so they are sometimes referred to as nominal or discrete features. These features are common in a wide range of real-world datasets and can be difficult to include into machine learning models.
We can classify categories into two main types −
Nominal Categorical features: These features represent categories with no basic order or ranking, like color, gender, and country. These features typically need special encoding, like Label encoding to be used in machine learning models.
Ordinal Categorical Features: These features represent categories having a meaningful order or ranking, like education level with categories such as high school, bachelor's degree, master's degree, and so on. These features can be written as integer numbers to show their order.
CatBoost Parameters
CatBoost is an advanced gradient boosting library that includes a huge number of parameters for changing and refining models. Let us have a look at some of the most commonly used CatBoost parameters −
learning_rate: As the loss function approaches its minimum, the step size is adjusted at each iteration. But it needs more iterations, a slower learning rate enhances training. The most common values range between 0.01 and 0.3.
iterations: It shows the total number of ensemble tree or boosting iterations. More iterations can boost model performance but they increase the possibility of overfitting. A few hundred to a few thousand is an average range.
l2_leaf_reg: It is the he weighted L2 regularization. Penalizing heavy weights on features helps to reduce overfitting. Adjusting this value improves regularization.
depth: It refers to the overall number of trees in the ensemble. It defines the level of complexity in each tree. While deeper trees can describe complicated relationships, they are more prone to overfitting. Values often vary between 4 and 10.
verbose: When set to True, it displays the training progress during iterations. If False, it works silently and does not print progress.
random_seed: It is the seed used by the random number generator. Setting this value ensures the reproducibility of the results.
one_hot_max_size: The maximum number of distinct categories that one-hot encoding of a categorical feature can support. CatBoost uses an effective approach to handle the feature differently if the number of unique categories exceeds this limit.
cat_features: A set of indices representing category features. CatBoost encodes these features for training purposes and handles them differently.
CatBoost with Categorical Features
Unlike many other machine learning models in which the categorical variables or features need to be manually encoded but CatBoost can handle categorical features directly. You just have to tell which features are categorical and CatBoost will handle them. So let us see how we can implement CatBoost with categorical features −
1. Install CatBoost Library
First you have to make sure that you have installed CatBoost library or not. If it is not installed then you can install it using the below command −
pip install catboost
Import the Libraries
After installing you are ready to use it in your model. So import necessary Libraries in your code like the below −
from catboost import CatBoostClassifier, Pool from sklearn.model_selection import train_test_split
2. Load and Prepare the Data
You have to make sure that your dataset has categorical features. As we have discussed earlier, You do not need to manually encode them.
# Sample Data data = [ ['red', 1, 5.1], ['blue', 0, 3.5], ['green', 1, 4.7], ['blue', 0, 2.9], ['red', 1, 5.0] ]
3. Target Values and Split the Data
Now you have to target values and then convert it into the data frame. After converting into the data frame you have to split the data into testing and training datasets to test and train the model.
# Target values labels = [1, 0, 1, 0, 1] # Convert to DataFrame import pandas as pd df = pd.DataFrame(data, columns=['color', 'feature1', 'feature2']) # Split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(df, labels, test_size=0.2, random_state=42)
4. Identify Categorical Features
As we have seen earlier in this chapter that in CatBoost you have to tell the model which features are categorical. And this thing can be done by passing the index or name of the categorical features just like the below −
# Declare categorical features (column index or name) categorical_features = ['color']
5. Train the Model
Now use the CatBoostClassifier or CatBoostRegressor of CatBoost for classification or regression tasks. Before doing it you have to pool object to specify the data, labels, and categorical features. See the code below −
# Pool object train_pool = Pool(data=X_train, label=y_train, cat_features=categorical_features) test_pool = Pool(data=X_test, label=y_test, cat_features=categorical_features) # Initialize and train the model model = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=6) model.fit(train_pool) # Predict on the test set preds = model.predict(test_pool) # Model accuracy accuracy = (preds == y_test).mean() print(f"Accuracy: {accuracy}")
Output
Here is the result of the above model we have created for Categorical Features in CatBoost −
0: learn: 0.6869753 total: 59.5ms remaining: 5.89s 1: learn: 0.6794071 total: 61.6ms remaining: 3.02s 2: learn: 0.6632128 total: 61.8ms remaining: 2s . . . 96: learn: 0.2241489 total: 83.4ms remaining: 2.58ms 97: learn: 0.2228507 total: 83.5ms remaining: 1.7ms 98: learn: 0.2215656 total: 83.7ms remaining: 845us 99: learn: 0.2202937 total: 83.9ms remaining: 0us Accuracy: 1.0