CatBoost - Core Parameters



CatBoost is an very useful machine learning library which is created for applications which needs categorization and regression. And you can use the code parameters to fit your dataset and the specific problem you are working on. Also to improve your model's accuracy, avoid overfitting and speed training you should use these parameters in your model.

Core Parameters

Core parameters in CatBoost are the main configurations that have a major effect on the behavior and functioning fo your model. Besides other things, these parameters control the number of training iterations, learning rate, tree depth, and loss function in the overall training process.

There are many parameters you can control while using CatBoost. Here are some core parameters listed below −

Key Parameters

The internal settings of the model that it learns at the time of training are known as the common parameters. Such as the split points and leaf values in a decision tree are parameters. The training process of CatBoost can be set by modifying a number of settings. Let us look at some important CatBoost parameters and their functions −

  • iterations: It is the number of times the increment is given. A new model is added to the ensemble with every iteration.

  • learning_rate: It is the degree of improvement at each iteration. So a lower learning rate means more constant but possibly delayed convergence.

  • depth: It is the maximum depth of a tree. A deeper tree allows for more complex connections but it can also overfit.

  • loss_function: This loss function is used to evaluate how well the model performed at the time of training. Common choices are RMSE for regression, CrossEntropy for multi-class classification, and Logloss for binary classification.

  • eval_metric: It is the performance metric of model used at the time of training.

  • random_seed: It is a random seed to make sure repetition.

Common Parameters

Here are the common parameters listed, mainly used for the Python package, R package and Command-line version −

1. loss_function:

It is mainly used to decide what kind of problem you are solving for example, classification or regression and which metric to optimize. To use this parameter you can set it to values like 'Logloss' for classification or 'RMSE' for regression.

Command Line −

--loss-function

2. custom_metric:

Using custom_metric you can monitor extra metrics at the time of training. These metrics are used for your information and not for optimizing the model. To use this parameter you need to list metrics that you want to track but these will not affect overall model performance.

Command Line −

--custom-metric

3. eval_metric

It is used for checking how the model is working at the time of training and to find overfitting. This metric helps to pick the best model. To use this you can choose a metric that fits your problem like 'Accuracy' for classification.

Command Line −

--eval-metric

4. iterations:

This parameter sets the number of trees (iterations) CatBoost will build. More iterations can improve accuracy but may increase training time.

Command Line −

-i, --iterations

5. learning_rate:

The learning rate controls how fast or slow the model learns. A smaller value results in better accuracy but requires more iterations.

Command Line −

-w, --learning-rate

6. random_seed:

It ensures the same results every time you train the model by fixing the random seed value.

Command Line −

-r, --random-seed

7. l2_leaf_reg:

This parameter adds L2 regularization to prevent overfitting. By increasing this value it can help reduce overfitting.

Command Line −

--l2-leaf-reg

8. bootstrap_type:

Defines the method for sampling the weights of objects during training. Options include 'Bayesian', 'Bernoulli', etc.

Command Line −

--bootstrap-type

9. bagging_temperature:

This adjusts the amount of randomness in data sampling during training. A higher value adds more randomness.

Command Line −

--bagging-temperature

10. subsample:

This parameter controls the percentage of data used for training each tree. A value below 1 uses only a fraction of the data.

Command Line −

--subsample

11. depth:

The depth of the tree determines how complex the model is. Deeper trees can model more complex patterns but may lead to overfitting.

Command Line −

-n, --depth

12. grow_policy:

Defines the strategy for growing trees. Different policies can be chosen based on the problem and dataset.

Command Line −

--grow-policy

13. min_data_in_leaf:

This sets the minimum number of data points that must be present in a leaf. It helps avoid overfitting by preventing splits on small samples.

Command Line −

--min-data-in-leaf

14. max_leaves:

This parameter controls the maximum number of leaves in a tree. It is only used with specific tree-growing policies.

Command Line −

--max-leaves

15. ignored_features:

You can exclude certain features from the model by specifying their indices or names. This is useful if some features are not relevant.

Command Line −

-I, --ignore-features

16. one_hot_max_size:

This parameter applies one-hot encoding to categorical features with a small number of unique values (below the specified limit).

Command Line −

--one-hot-max-size

17. class_weights:

This parameter allows you to assign different weights to different classes, especially useful when the data is imbalanced (one class has far fewer examples).

Command Line −

--class-weights

CatBoost Example using Common Parameters

Here is an example of how to build a CatBoost model using a few of the commonly used parameters. This Python code example shows how to use these parameters −

# Import the necessary libraries
from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score

# Load the Iris dataset as an example
data = load_iris()
X = data['data']
y = data['target']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize CatBoostClassifier with common parameters
model = CatBoostClassifier(
   # Number of trees 
   iterations=100,              
   # Learning rate
   learning_rate=0.1,           
   # Depth of the trees
   depth=6,                     
   # Loss function for multi-class classification
   loss_function='MultiClass',  
   # Metric for evaluating performance
   eval_metric='Accuracy',      
   # Random seed for reproducibility
   random_seed=42,              
   # L2 regularization to prevent overfitting
   l2_leaf_reg=3.0,             
   # Bootstrap method for bagging
   bootstrap_type='Bernoulli',   
   # Silent mode, no training output
   verbose=False                 
)

# Train the model
model.fit(X_train, y_train, eval_set=(X_test, y_test))

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.4f}')

Output

This will generate the below result:

Accuracy: 1.0000
Advertisements