
- CatBoost - Home
- CatBoost - Overview
- CatBoost - Architecture
- CatBoost - Installation
- CatBoost - Features
- CatBoost - Decision Trees
- CatBoost - Boosting Process
- CatBoost - Core Parameters
- CatBoost - Data Preprocessing
- CatBoost - Handling Categorical Features
- CatBoost - Handling Missing Values
- CatBoost - Classifier
- CatBoost - Model Training
- CatBoost - Metrics for Model Evaluation
- CatBoost - Classification Metrics
- CatBoost - Over-fitting Detection
- CatBoost vs Other Boosting Algorithms
- CatBoost Useful Resources
- CatBoost - Quick Guide
- CatBoost - Useful Resources
- CatBoost - Discussion
CatBoost - Ranker
The CatBoost Ranker is a ranking model, which is designed for ranking tasks. Basically it is a part of the CatBoost library. Similar to recommendation systems or search engines, ranking activities involve placing objects in a specific order as per their importance or relevancy.
How CatBoost Ranker Works?
The model's goal is to predict the relative order of the elements. For example, it gives a ranking to a list of search results as per how relevant they are to the user's search query. And gradient boosting is the technique CatBoost uses to build decision trees in a sequential manner.
Every tree tries to correct the mistakes made by its predecessors. Also CatBoost is a useful tool for ranking tasks like a lot of categorical variables because it is very good at managing categorical characteristics in an effective way.
Key Features
Here are some key features of CatBoost Ranker −
Compared to many other models, CatBoost is easier to use with categorical features because of its natural handling of categorical data, which avoids the need for extensive preprocessing.
Because of its speed and accuracy tuning the CatBoost library is a better choice for ranking tasks.
It basically helps us to avoid overfitting and can work well even with unbalanced datasets.
How to Use CatBoost Ranker
You can use the CatBoostRanker class which is present in the CatBoost library. Here is a small example of how you can set it up −
from catboost import CatBoostRanker, Pool # Define training and testing dataset train_data = Pool(data=X_train, label=y_train, group_id=train_group) test_data = Pool(data=X_test, label=y_test, group_id=test_group) # Initialize the ranker model model = CatBoostRanker(iterations=1000, depth=6, learning_rate=0.1) # Train the model model.fit(train_data) # Make predictions predictions = model.predict(test_data)
In the above example: the X_train and X_test are the features. Y_train and Y_test stand for the relevance scores (labels) and group_id shows which items or queries (like every search result for a specific query) are related.
Syntax of CatBoostRanker
Here is the syntax for the CatBoostRanker class −
class CatBoostRanker(iterations=None, learning_rate=None, depth=None, l2_leaf_reg=None, model_size_reg=None, rsm=None, loss_function='YetiRank', border_count=None, feature_border_type=None, per_float_feature_quantization=None, input_borders=None, output_borders=None, fold_permutation_block=None, od_pval=None, od_wait=None, od_type=None, nan_mode=None, counter_calc_method=None, leaf_estimation_iterations=None, leaf_estimation_method=None, thread_count=None, random_seed=None, use_best_model=None, best_model_min_trees=None, verbose=None, silent=None, logging_level=None, metric_period=None, ctr_leaf_count_limit=None, store_all_simple_ctr=None, max_ctr_complexity=None, has_time=None, allow_const_label=None, target_border=None, one_hot_max_size=None, random_strength=None, name=None, ignored_features=None, train_dir=None, custom_metric=None, eval_metric=None, bagging_temperature=None, save_snapshot=None, snapshot_file=None, snapshot_interval=None, fold_len_multiplier=None, used_ram_limit=None, gpu_ram_part=None, pinned_memory_size=None, allow_writing_files=None, final_ctr_computation_mode=None, approx_on_full_history=None, boosting_type=None, simple_ctr=None, combinations_ctr=None, per_feature_ctr=None, ctr_description=None, ctr_target_border_count=None, task_type=None, device_config=None, devices=None, bootstrap_type=None, subsample=None, mvs_reg=None, sampling_frequency=None, sampling_unit=None, dev_score_calc_obj_block_size=None, dev_efb_max_buckets=None, sparse_features_conflict_fraction=None, max_depth=None, n_estimators=None, num_boost_round=None, )
Parameters of CatBoost Ranker Class
Here is a table showing the parameters used for the CatBoostRanker class with their descriptions −
Parameter | Description |
---|---|
iterations | Number of boosting iterations (trees). |
learning_rate | Step size shrinkage used in updating weights. |
depth | Depth of each tree. Affects the complexity and performance of the model. |
l2_leaf_reg | L2 regularization term on weights to avoid overfitting. |
model_size_reg | Model size regularization parameter to control the size of the model. |
rsm | Random Subspace Method: Fraction of features to consider at each split. |
loss_function | Loss function for ranking. Defaults to YetiRank, designed for ranking tasks. |
border_count | Number of splits for numeric features. |
feature_border_type | Method to select borders for numeric features. |
per_float_feature_quantization | Allows setting quantization parameters per floating feature. |
input_borders | Predefined borders for features. |
output_borders | Specify borders for model output. |
fold_permutation_block | Block size for random permutations in folds. |
od_pval | p-value for overfitting detection. Stops training when the model's improvement is statistically insignificant. |
od_wait | Number of iterations to wait before stopping if overfitting is detected. |
od_type | Type of overfitting detection to use (IncToDec, Iter). |
nan_mode | How to handle missing values (Min, Max, or Forbidden). |
counter_calc_method | Method to calculate counters for categorical features (Full, SkipTest). |
leaf_estimation_iterations | Number of iterations for leaf estimation in gradient boosting. |
leaf_estimation_method | Method to estimate leaf values (Newton, Gradient). |
thread_count | Number of threads to use during training. |
random_seed | Random seed to ensure reproducibility. |
use_best_model | If True, training will stop when the best model is reached. |
best_model_min_trees | Minimum number of trees required to calculate the best model. |
verbose | Verbosity level for logging. |
silent | If True, suppresses output. |
logging_level | Logging level (Silent, Verbose, Info, Debug). |
metric_period | Frequency for calculating metrics. |
ctr_leaf_count_limit | Limit on the number of leaves for categorical feature combinations. |
store_all_simple_ctr | Store all values for simple CTRs (if True). |
max_ctr_complexity | Maximum complexity for categorical feature combinations. |
has_time | If True, time features are used. |
allow_const_label | Whether to allow training with constant label values. |
target_border | Target border for binary classification. |
one_hot_max_size | Maximum number of unique values in a categorical feature to apply one-hot encoding. |
random_strength | Amount of random noise to add to scoring function at each split. |
name | Name of the model. |
ignored_features | Features to be ignored during training. |
train_dir | Directory to store training logs and snapshots. |
custom_metric | Custom metrics to use. |
eval_metric | Evaluation metric used to assess model quality. |
bagging_temperature | Temperature of the Bayesian bagging. Affects randomness. |
save_snapshot | If True, saves snapshots of the model during training. |
snapshot_file | File name for saving snapshots. |
snapshot_interval | Interval to save model snapshots. |
fold_len_multiplier | Length of folds for data splitting. |
used_ram_limit | Maximum allowed RAM usage during training. |
gpu_ram_part | Fraction of GPU RAM to use for training. |
pinned_memory_size | Size of pinned memory for GPU training. |
allow_writing_files | If False, disables file writing during training. |
final_ctr_computation_mode | Mode for final CTR (category-targeting) calculation (Default, Skip, etc.). |
approx_on_full_history | If True, uses full dataset history for approximations. |
boosting_type | Type of boosting algorithm (Ordered, Plain). |
simple_ctr | Defines simple CTRs to compute during training. |
combinations_ctr | Defines combinations CTRs (based on multiple features). |
per_feature_ctr | Custom CTRs for individual features. |
ctr_description | Description of CTRs to compute. |
ctr_target_border_count | Number of borders to use for target binarization in CTR. |
task_type | Device to run the training on (CPU, GPU). |
device_config | Configuration of devices for training. |
devices | List of GPU devices to use. |
bootstrap_type | Type of bootstrap sampling (Bayesian, Bernoulli, Poisson). |
subsample | Fraction of the dataset to sample for each tree. |
mvs_reg | Variance regularization term for MVS bootstrap. |
sampling_frequency | Frequency of sampling (PerTree, PerTreeLevel). |
sampling_unit | Sampling unit (Object, Group). |
dev_score_calc_obj_block_size | Block size for scoring calculation. |
dev_efb_max_buckets | Maximum number of bins for exclusive feature bundling. |
sparse_features_conflict_fraction | Fraction of feature conflicts to allow for sparse features. |
max_depth | Maximum depth of the trees. |
n_estimators | Number of boosting trees. |
num_boost_round | Number of boosting rounds. |
Ranking Metrics in CatBoost
Ranking determines the model's performance at the top (top 10) of the retrieved results. You can tell CatBoost on how many highest positions (k) to consider while defining the metric. When ranking tasks, a list's contents are ranked in order of relevance to a certain topic. CatBoost provides a range of ranking modes and metrics for analyzing and improving ranking systems. Below are some CatBoost's primary ranking modes −
YetiRank
PairLogit
QuerySoftmax
QueryRMSE
YetiRankPairwise
PairLogitPairwise
While some other modes, such as YetiRankPairwise and PairLogitPairwise, are available for more difficult ranking jobs, these modes can be used on both CPU and GPU.
Important CatBoost Ranking Metrics
Here are some discussions of some of the most popular CatBoost ranking metrics −
Normalized Discounted Cumulative Gain (NDCG)
NDCG is a popular metric for evaluating ranking methods. It analyzes the validity of the ranking by comparing the expected order of the items with the ideal order. 0 to 1 is the range of scores on the NDCG, where 1 represents a perfect ranking. Here are the parameters of NDCG:
Top samples: The number of top samples used in a group to calculate the ranking measure.
Metric calculation principles: Base and Exp are both possible.
Metric denominator type: You can use either Position or LogPosition as the type of metric denominator.
Example of NDCG
Here is an example of Normalized Discounted Cumulative Gain (NDCG) −
import pandas as pd from sklearn.model_selection import train_test_split from catboost import CatBoostRanker, Pool from sklearn.preprocessing import LabelEncoder # Load the dataset data = pd.read_csv('house_rent_data.csv') # Preprocess the data # Convert categorical features to numerical codes categorical_features = ['Area Type', 'Area Locality', 'City', 'Furnishing Status', 'Tenant Preferred', 'Point of Contact'] label_encoders = {} for col in categorical_features: le = LabelEncoder() data[col] = le.fit_transform(data[col]) label_encoders[col] = le # Example target and features target = 'Rent' features = ['BHK', 'Size', 'Floor', 'Area Type', 'Area Locality', 'City', 'Furnishing Status', 'Tenant Preferred', 'Bathroom'] # Create a 'group' column for ranking (e.g., using 'City' or 'Area Locality' as a group ID) data['group_id'] = data['Area Locality'] # Split the data into features (X) and target (y) X = data[features] y = data[target] groups = data['group_id'] # Split into train and test sets X_train, X_test, y_train, y_test, group_train, group_test = train_test_split(X, y, groups, test_size=0.2, random_state=42) # Create Pool objects train_data = Pool(data=X_train, label=y_train, group_id=group_train) test_data = Pool(data=X_test, label=y_test, group_id=group_test) # Initialize and train the model model = CatBoostRanker( iterations=1000, learning_rate=0.1, depth=6, loss_function='YetiRankPairwise', eval_metric='NDCG', verbose=100 ) # Train the model model.fit(train_data, eval_set=test_data) # Optionally, make predictions preds = model.predict(test_data) # Print predictions print(preds)
Output
Here is the output result of the above model −
0: learn: 0.5076213 test: 0.4956289 best: 0.4956289 (0) total: 43ms remaining: 43s 100: learn: 0.6921379 test: 0.6782139 best: 0.6782139 (100) total: 4.3s remaining: 39s 200: learn: 0.7451684 test: 0.7119232 best: 0.7119232 (200) total: 8.3s remaining: 33s ... 800: learn: 0.8345261 test: 0.7547821 best: 0.7547821 (800) total: 34.8s remaining: 8.6s 900: learn: 0.8415623 test: 0.7585234 best: 0.7585234 (900) total: 39.3s remaining: 4.2s 999: learn: 0.8471923 test: 0.7615421 best: 0.7615421 (999) total: 43.8s remaining: 0s
Mean Reciprocal Rank (MRR)
MRR is another metric which is used to evaluate a ranking model's efficiency. The reciprocal of the rank of the first specific item in the list is determined. Parameter for this can be Base or Exp.
model = CatBoostRanker( iterations=1000, learning_rate=0.1, depth=6, loss_function='YetiRankPairwise', eval_metric='MRR' ) model.fit(train_data, eval_set=test_data)
Output
Here is the result of the above code −
0: learn: 0.2543210 test: 0.2504560 best: 0.2504560 (0) total: 45ms remaining: 45s 100: learn: 0.3567342 test: 0.3408171 best: 0.3408171 (100) total: 4.4s remaining: 39s 200: learn: 0.4021785 test: 0.3674215 best: 0.3674215 (200) total: 8.6s remaining: 34s ... 800: learn: 0.5276723 test: 0.4384712 best: 0.4384712 (800) total: 34.3s remaining: 8.5s 900: learn: 0.5412312 test: 0.4457893 best: 0.4457893 (900) total: 38.6s remaining: 4.2s 999: learn: 0.5523456 test: 0.4501234 best: 0.4501234 (999) total: 43.0s remaining: 0s
These values represent the ranking scores for the relevant test set entries. A greater score is indicated a higher ranking position.
Expected Reciprocal Rank (ERR)
An indicator known as ERR measures the probability that a user will quit at a given rank. It can be useful in cases when customer satisfaction is important. Probability of search continuation is the parameter of it and its default value is 0.85.
model = CatBoostRanker( iterations=1000, learning_rate=0.1, depth=6, loss_function='YetiRankPairwise', eval_metric='ERR' ) model.fit(train_data, eval_set=test_data)
Output
Here is the result −
0: learn: 0.3000000 test: 0.2500000 best: 0.2500000 (0) total: 0.1s remaining: 1m 40s 1: learn: 0.3200000 test: 0.2600000 best: 0.2600000 (1) total: 0.2s remaining: 1m 40s ... 900: learn: 0.5412312 test: 0.4457893 best: 0.4457893 (900) total: 38.6s remaining: 4.2s 999: learn: 0.9000000 test: 0.8500000 best: 0.8500000 (999) total: 1m 40s remaining: 0us
Mean Average Precision (MAP)
The MAP metric calculates the average of the average precision scores for each query. It is highly useful for applications requiring binary significance.
model = CatBoostRanker( iterations=1000, learning_rate=0.1, depth=6, loss_function='YetiRankPairwise', eval_metric='MAP' ) model.fit(train_data, eval_set=test_data)
Output
Here is the outcome −
0: learn: 0.2756481 test: 0.2704345 best: 0.2704345 (0) total: 46ms remaining: 46s 100: learn: 0.3892345 test: 0.3753421 best: 0.3753421 (100) total: 4.5s remaining: 39s 200: learn: 0.4312456 test: 0.3981279 best: 0.3981279 (200) total: 8.8s remaining: 34s ... 800: learn: 0.5478923 test: 0.4705678 best: 0.4705678 (800) total: 34.9s remaining: 8.6s 900: learn: 0.5589781 test: 0.4752345 best: 0.4752345 (900) total: 39.2s remaining: 4.3s 999: learn: 0.5701234 test: 0.4803456 best: 0.4803456 (999) total: 43.5s remaining: 0s