
- CatBoost - Home
- CatBoost - Overview
- CatBoost - Architecture
- CatBoost - Installation
- CatBoost - Features
- CatBoost - Decision Trees
- CatBoost - Boosting Process
- CatBoost - Core Parameters
- CatBoost - Data Preprocessing
- CatBoost - Handling Categorical Features
- CatBoost - Handling Missing Values
- CatBoost - Classifier
- CatBoost - Model Training
- CatBoost - Metrics for Model Evaluation
- CatBoost - Classification Metrics
- CatBoost - Over-fitting Detection
- CatBoost vs Other Boosting Algorithms
- CatBoost Useful Resources
- CatBoost - Quick Guide
- CatBoost - Useful Resources
- CatBoost - Discussion
CatBoost - Core Parameters
CatBoost is an very useful machine learning library which is created for applications which needs categorization and regression. And you can use the code parameters to fit your dataset and the specific problem you are working on. Also to improve your model's accuracy, avoid overfitting and speed training you should use these parameters in your model.
Core Parameters
Core parameters in CatBoost are the main configurations that have a major effect on the behavior and functioning fo your model. Besides other things, these parameters control the number of training iterations, learning rate, tree depth, and loss function in the overall training process.
There are many parameters you can control while using CatBoost. Here are some core parameters listed below −
Key Parameters
The internal settings of the model that it learns at the time of training are known as the common parameters. Such as the split points and leaf values in a decision tree are parameters. The training process of CatBoost can be set by modifying a number of settings. Let us look at some important CatBoost parameters and their functions −
iterations: It is the number of times the increment is given. A new model is added to the ensemble with every iteration.
learning_rate: It is the degree of improvement at each iteration. So a lower learning rate means more constant but possibly delayed convergence.
depth: It is the maximum depth of a tree. A deeper tree allows for more complex connections but it can also overfit.
loss_function: This loss function is used to evaluate how well the model performed at the time of training. Common choices are RMSE for regression, CrossEntropy for multi-class classification, and Logloss for binary classification.
eval_metric: It is the performance metric of model used at the time of training.
random_seed: It is a random seed to make sure repetition.
Common Parameters
Here are the common parameters listed, mainly used for the Python package, R package and Command-line version −
1. loss_function:
It is mainly used to decide what kind of problem you are solving for example, classification or regression and which metric to optimize. To use this parameter you can set it to values like 'Logloss' for classification or 'RMSE' for regression.
Command Line −
--loss-function
2. custom_metric:
Using custom_metric you can monitor extra metrics at the time of training. These metrics are used for your information and not for optimizing the model. To use this parameter you need to list metrics that you want to track but these will not affect overall model performance.
Command Line −
--custom-metric
3. eval_metric
It is used for checking how the model is working at the time of training and to find overfitting. This metric helps to pick the best model. To use this you can choose a metric that fits your problem like 'Accuracy' for classification.
Command Line −
--eval-metric
4. iterations:
This parameter sets the number of trees (iterations) CatBoost will build. More iterations can improve accuracy but may increase training time.
Command Line −
-i, --iterations
5. learning_rate:
The learning rate controls how fast or slow the model learns. A smaller value results in better accuracy but requires more iterations.
Command Line −
-w, --learning-rate
6. random_seed:
It ensures the same results every time you train the model by fixing the random seed value.
Command Line −
-r, --random-seed
7. l2_leaf_reg:
This parameter adds L2 regularization to prevent overfitting. By increasing this value it can help reduce overfitting.
Command Line −
--l2-leaf-reg
8. bootstrap_type:
Defines the method for sampling the weights of objects during training. Options include 'Bayesian', 'Bernoulli', etc.
Command Line −
--bootstrap-type
9. bagging_temperature:
This adjusts the amount of randomness in data sampling during training. A higher value adds more randomness.
Command Line −
--bagging-temperature
10. subsample:
This parameter controls the percentage of data used for training each tree. A value below 1 uses only a fraction of the data.
Command Line −
--subsample
11. depth:
The depth of the tree determines how complex the model is. Deeper trees can model more complex patterns but may lead to overfitting.
Command Line −
-n, --depth
12. grow_policy:
Defines the strategy for growing trees. Different policies can be chosen based on the problem and dataset.
Command Line −
--grow-policy
13. min_data_in_leaf:
This sets the minimum number of data points that must be present in a leaf. It helps avoid overfitting by preventing splits on small samples.
Command Line −
--min-data-in-leaf
14. max_leaves:
This parameter controls the maximum number of leaves in a tree. It is only used with specific tree-growing policies.
Command Line −
--max-leaves
15. ignored_features:
You can exclude certain features from the model by specifying their indices or names. This is useful if some features are not relevant.
Command Line −
-I, --ignore-features
16. one_hot_max_size:
This parameter applies one-hot encoding to categorical features with a small number of unique values (below the specified limit).
Command Line −
--one-hot-max-size
17. class_weights:
This parameter allows you to assign different weights to different classes, especially useful when the data is imbalanced (one class has far fewer examples).
Command Line −
--class-weights
CatBoost Example using Common Parameters
Here is an example of how to build a CatBoost model using a few of the commonly used parameters. This Python code example shows how to use these parameters −
# Import the necessary libraries from catboost import CatBoostClassifier from sklearn.model_selection import train_test_split from sklearn.datasets import load_iris from sklearn.metrics import accuracy_score # Load the Iris dataset as an example data = load_iris() X = data['data'] y = data['target'] # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Initialize CatBoostClassifier with common parameters model = CatBoostClassifier( # Number of trees iterations=100, # Learning rate learning_rate=0.1, # Depth of the trees depth=6, # Loss function for multi-class classification loss_function='MultiClass', # Metric for evaluating performance eval_metric='Accuracy', # Random seed for reproducibility random_seed=42, # L2 regularization to prevent overfitting l2_leaf_reg=3.0, # Bootstrap method for bagging bootstrap_type='Bernoulli', # Silent mode, no training output verbose=False ) # Train the model model.fit(X_train, y_train, eval_set=(X_test, y_test)) # Make predictions y_pred = model.predict(X_test) # Evaluate the model accuracy = accuracy_score(y_test, y_pred) print(f'Accuracy: {accuracy:.4f}')
Output
This will generate the below result:
Accuracy: 1.0000