CatBoost - Ranker



The CatBoost Ranker is a ranking model, which is designed for ranking tasks. Basically it is a part of the CatBoost library. Similar to recommendation systems or search engines, ranking activities involve placing objects in a specific order as per their importance or relevancy.

How CatBoost Ranker Works?

The model's goal is to predict the relative order of the elements. For example, it gives a ranking to a list of search results as per how relevant they are to the user's search query. And gradient boosting is the technique CatBoost uses to build decision trees in a sequential manner.

Every tree tries to correct the mistakes made by its predecessors. Also CatBoost is a useful tool for ranking tasks like a lot of categorical variables because it is very good at managing categorical characteristics in an effective way.

Key Features

Here are some key features of CatBoost Ranker −

  • Compared to many other models, CatBoost is easier to use with categorical features because of its natural handling of categorical data, which avoids the need for extensive preprocessing.

  • Because of its speed and accuracy tuning the CatBoost library is a better choice for ranking tasks.

  • It basically helps us to avoid overfitting and can work well even with unbalanced datasets.

How to Use CatBoost Ranker

You can use the CatBoostRanker class which is present in the CatBoost library. Here is a small example of how you can set it up −

from catboost import CatBoostRanker, Pool

# Define training and testing dataset
train_data = Pool(data=X_train, label=y_train, group_id=train_group)
test_data = Pool(data=X_test, label=y_test, group_id=test_group)

# Initialize the ranker model
model = CatBoostRanker(iterations=1000, depth=6, learning_rate=0.1)

# Train the model
model.fit(train_data)

# Make predictions
predictions = model.predict(test_data)

In the above example: the X_train and X_test are the features. Y_train and Y_test stand for the relevance scores (labels) and group_id shows which items or queries (like every search result for a specific query) are related.

Syntax of CatBoostRanker

Here is the syntax for the CatBoostRanker class −

class CatBoostRanker(iterations=None,
   learning_rate=None,
   depth=None,
   l2_leaf_reg=None,
   model_size_reg=None,
   rsm=None,
   loss_function='YetiRank',
   border_count=None,
   feature_border_type=None,
   per_float_feature_quantization=None,
   input_borders=None,
   output_borders=None,
   fold_permutation_block=None,
   od_pval=None,
   od_wait=None,
   od_type=None,
   nan_mode=None,
   counter_calc_method=None,
   leaf_estimation_iterations=None,
   leaf_estimation_method=None,
   thread_count=None,
   random_seed=None,
   use_best_model=None,
   best_model_min_trees=None,
   verbose=None,
   silent=None,
   logging_level=None,
   metric_period=None,
   ctr_leaf_count_limit=None,
   store_all_simple_ctr=None,
   max_ctr_complexity=None,
   has_time=None,
   allow_const_label=None,
   target_border=None,
   one_hot_max_size=None,
   random_strength=None,
   name=None,
   ignored_features=None,
   train_dir=None,
   custom_metric=None,
   eval_metric=None,
   bagging_temperature=None,
   save_snapshot=None,
   snapshot_file=None,
   snapshot_interval=None,
   fold_len_multiplier=None,
   used_ram_limit=None,
   gpu_ram_part=None,
   pinned_memory_size=None,
   allow_writing_files=None,
   final_ctr_computation_mode=None,
   approx_on_full_history=None,
   boosting_type=None,
   simple_ctr=None,
   combinations_ctr=None,
   per_feature_ctr=None,
   ctr_description=None,
   ctr_target_border_count=None,
   task_type=None,
   device_config=None,
   devices=None,
   bootstrap_type=None,
   subsample=None,
   mvs_reg=None,
   sampling_frequency=None,
   sampling_unit=None,
   dev_score_calc_obj_block_size=None,
   dev_efb_max_buckets=None,
   sparse_features_conflict_fraction=None,
   max_depth=None,
   n_estimators=None,
   num_boost_round=None,
)

Parameters of CatBoost Ranker Class

Here is a table showing the parameters used for the CatBoostRanker class with their descriptions −

Parameter Description
iterations Number of boosting iterations (trees).
learning_rate Step size shrinkage used in updating weights.
depth Depth of each tree. Affects the complexity and performance of the model.
l2_leaf_reg L2 regularization term on weights to avoid overfitting.
model_size_reg Model size regularization parameter to control the size of the model.
rsm Random Subspace Method: Fraction of features to consider at each split.
loss_function Loss function for ranking. Defaults to YetiRank, designed for ranking tasks.
border_count Number of splits for numeric features.
feature_border_type Method to select borders for numeric features.
per_float_feature_quantization Allows setting quantization parameters per floating feature.
input_borders Predefined borders for features.
output_borders Specify borders for model output.
fold_permutation_block Block size for random permutations in folds.
od_pval p-value for overfitting detection. Stops training when the model's improvement is statistically insignificant.
od_wait Number of iterations to wait before stopping if overfitting is detected.
od_type Type of overfitting detection to use (IncToDec, Iter).
nan_mode How to handle missing values (Min, Max, or Forbidden).
counter_calc_method Method to calculate counters for categorical features (Full, SkipTest).
leaf_estimation_iterations Number of iterations for leaf estimation in gradient boosting.
leaf_estimation_method Method to estimate leaf values (Newton, Gradient).
thread_count Number of threads to use during training.
random_seed Random seed to ensure reproducibility.
use_best_model If True, training will stop when the best model is reached.
best_model_min_trees Minimum number of trees required to calculate the best model.
verbose Verbosity level for logging.
silent If True, suppresses output.
logging_level Logging level (Silent, Verbose, Info, Debug).
metric_period Frequency for calculating metrics.
ctr_leaf_count_limit Limit on the number of leaves for categorical feature combinations.
store_all_simple_ctr Store all values for simple CTRs (if True).
max_ctr_complexity Maximum complexity for categorical feature combinations.
has_time If True, time features are used.
allow_const_label Whether to allow training with constant label values.
target_border Target border for binary classification.
one_hot_max_size Maximum number of unique values in a categorical feature to apply one-hot encoding.
random_strength Amount of random noise to add to scoring function at each split.
name Name of the model.
ignored_features Features to be ignored during training.
train_dir Directory to store training logs and snapshots.
custom_metric Custom metrics to use.
eval_metric Evaluation metric used to assess model quality.
bagging_temperature Temperature of the Bayesian bagging. Affects randomness.
save_snapshot If True, saves snapshots of the model during training.
snapshot_file File name for saving snapshots.
snapshot_interval Interval to save model snapshots.
fold_len_multiplier Length of folds for data splitting.
used_ram_limit Maximum allowed RAM usage during training.
gpu_ram_part Fraction of GPU RAM to use for training.
pinned_memory_size Size of pinned memory for GPU training.
allow_writing_files If False, disables file writing during training.
final_ctr_computation_mode Mode for final CTR (category-targeting) calculation (Default, Skip, etc.).
approx_on_full_history If True, uses full dataset history for approximations.
boosting_type Type of boosting algorithm (Ordered, Plain).
simple_ctr Defines simple CTRs to compute during training.
combinations_ctr Defines combinations CTRs (based on multiple features).
per_feature_ctr Custom CTRs for individual features.
ctr_description Description of CTRs to compute.
ctr_target_border_count Number of borders to use for target binarization in CTR.
task_type Device to run the training on (CPU, GPU).
device_config Configuration of devices for training.
devices List of GPU devices to use.
bootstrap_type Type of bootstrap sampling (Bayesian, Bernoulli, Poisson).
subsample Fraction of the dataset to sample for each tree.
mvs_reg Variance regularization term for MVS bootstrap.
sampling_frequency Frequency of sampling (PerTree, PerTreeLevel).
sampling_unit Sampling unit (Object, Group).
dev_score_calc_obj_block_size Block size for scoring calculation.
dev_efb_max_buckets Maximum number of bins for exclusive feature bundling.
sparse_features_conflict_fraction Fraction of feature conflicts to allow for sparse features.
max_depth Maximum depth of the trees.
n_estimators Number of boosting trees.
num_boost_round Number of boosting rounds.

Ranking Metrics in CatBoost

Ranking determines the model's performance at the top (top 10) of the retrieved results. You can tell CatBoost on how many highest positions (k) to consider while defining the metric. When ranking tasks, a list's contents are ranked in order of relevance to a certain topic. CatBoost provides a range of ranking modes and metrics for analyzing and improving ranking systems. Below are some CatBoost's primary ranking modes −

  • YetiRank

  • PairLogit

  • QuerySoftmax

  • QueryRMSE

  • YetiRankPairwise

  • PairLogitPairwise

While some other modes, such as YetiRankPairwise and PairLogitPairwise, are available for more difficult ranking jobs, these modes can be used on both CPU and GPU.

Important CatBoost Ranking Metrics

Here are some discussions of some of the most popular CatBoost ranking metrics −

Normalized Discounted Cumulative Gain (NDCG)

NDCG is a popular metric for evaluating ranking methods. It analyzes the validity of the ranking by comparing the expected order of the items with the ideal order. 0 to 1 is the range of scores on the NDCG, where 1 represents a perfect ranking. Here are the parameters of NDCG:

  • Top samples: The number of top samples used in a group to calculate the ranking measure.

  • Metric calculation principles: Base and Exp are both possible.

  • Metric denominator type: You can use either Position or LogPosition as the type of metric denominator.

Example of NDCG

Here is an example of Normalized Discounted Cumulative Gain (NDCG) −

import pandas as pd
from sklearn.model_selection import train_test_split
from catboost import CatBoostRanker, Pool
from sklearn.preprocessing import LabelEncoder

# Load the dataset
data = pd.read_csv('house_rent_data.csv')

# Preprocess the data
# Convert categorical features to numerical codes
categorical_features = ['Area Type', 'Area Locality', 'City', 'Furnishing Status', 'Tenant Preferred', 'Point of Contact']
label_encoders = {}

for col in categorical_features:
    le = LabelEncoder()
    data[col] = le.fit_transform(data[col])
    label_encoders[col] = le

# Example target and features
target = 'Rent'
features = ['BHK', 'Size', 'Floor', 'Area Type', 'Area Locality', 'City', 'Furnishing Status', 'Tenant Preferred', 'Bathroom']

# Create a 'group' column for ranking (e.g., using 'City' or 'Area Locality' as a group ID)
data['group_id'] = data['Area Locality']

# Split the data into features (X) and target (y)
X = data[features]
y = data[target]
groups = data['group_id']

# Split into train and test sets
X_train, X_test, y_train, y_test, group_train, group_test = train_test_split(X, y, groups, test_size=0.2, random_state=42)

# Create Pool objects
train_data = Pool(data=X_train, label=y_train, group_id=group_train)
test_data = Pool(data=X_test, label=y_test, group_id=group_test)

# Initialize and train the model
model = CatBoostRanker(
    iterations=1000,
    learning_rate=0.1,
    depth=6,
    loss_function='YetiRankPairwise',
    eval_metric='NDCG',
    verbose=100
)

# Train the model
model.fit(train_data, eval_set=test_data)

# Optionally, make predictions
preds = model.predict(test_data)

# Print predictions
print(preds)

Output

Here is the output result of the above model −

0:  learn: 0.5076213    test: 0.4956289   best: 0.4956289 (0)   total: 43ms    remaining: 43s
100:    learn: 0.6921379   test: 0.6782139  best: 0.6782139 (100)  total: 4.3s    remaining: 39s
200:    learn: 0.7451684   test: 0.7119232  best: 0.7119232 (200)  total: 8.3s    remaining: 33s
...
800:    learn: 0.8345261   test: 0.7547821  best: 0.7547821 (800)  total: 34.8s   remaining: 8.6s
900:    learn: 0.8415623   test: 0.7585234  best: 0.7585234 (900)  total: 39.3s   remaining: 4.2s
999:    learn: 0.8471923   test: 0.7615421  best: 0.7615421 (999)  total: 43.8s   remaining: 0s

Mean Reciprocal Rank (MRR)

MRR is another metric which is used to evaluate a ranking model's efficiency. The reciprocal of the rank of the first specific item in the list is determined. Parameter for this can be Base or Exp.

model = CatBoostRanker(
   iterations=1000,
   learning_rate=0.1,
   depth=6,
   loss_function='YetiRankPairwise',
   eval_metric='MRR'
)
model.fit(train_data, eval_set=test_data)

Output

Here is the result of the above code −

0:    learn: 0.2543210   test: 0.2504560  best: 0.2504560 (0)   total: 45ms   remaining: 45s
100:  learn: 0.3567342   test: 0.3408171  best: 0.3408171 (100) total: 4.4s   remaining: 39s
200:  learn: 0.4021785   test: 0.3674215  best: 0.3674215 (200) total: 8.6s   remaining: 34s
...
800:  learn: 0.5276723   test: 0.4384712  best: 0.4384712 (800) total: 34.3s  remaining: 8.5s
900:  learn: 0.5412312   test: 0.4457893  best: 0.4457893 (900) total: 38.6s  remaining: 4.2s
999:  learn: 0.5523456   test: 0.4501234  best: 0.4501234 (999) total: 43.0s  remaining: 0s

These values represent the ranking scores for the relevant test set entries. A greater score is indicated a higher ranking position.

Expected Reciprocal Rank (ERR)

An indicator known as ERR measures the probability that a user will quit at a given rank. It can be useful in cases when customer satisfaction is important. Probability of search continuation is the parameter of it and its default value is 0.85.

model = CatBoostRanker(
   iterations=1000,
   learning_rate=0.1,
   depth=6,
   loss_function='YetiRankPairwise',
   eval_metric='ERR'
)
model.fit(train_data, eval_set=test_data)

Output

Here is the result −

0:    learn: 0.3000000    test: 0.2500000    best: 0.2500000 (0)    total: 0.1s    remaining: 1m 40s
1:    learn: 0.3200000    test: 0.2600000    best: 0.2600000 (1)    total: 0.2s    remaining: 1m 40s
...
900:  learn: 0.5412312    test: 0.4457893    best: 0.4457893 (900)  total: 38.6s   remaining: 4.2s
999:    learn: 0.9000000    test: 0.8500000    best: 0.8500000 (999)    total: 1m 40s    remaining: 0us

Mean Average Precision (MAP)

The MAP metric calculates the average of the average precision scores for each query. It is highly useful for applications requiring binary significance.

model = CatBoostRanker(
   iterations=1000,
   learning_rate=0.1,
   depth=6,
   loss_function='YetiRankPairwise',
   eval_metric='MAP'
)
model.fit(train_data, eval_set=test_data)

Output

Here is the outcome −

0:    learn: 0.2756481   test: 0.2704345  best: 0.2704345 (0)   total: 46ms   remaining: 46s
100:  learn: 0.3892345   test: 0.3753421  best: 0.3753421 (100) total: 4.5s   remaining: 39s
200:  learn: 0.4312456   test: 0.3981279  best: 0.3981279 (200) total: 8.8s   remaining: 34s
...
800:  learn: 0.5478923   test: 0.4705678  best: 0.4705678 (800) total: 34.9s  remaining: 8.6s
900:  learn: 0.5589781   test: 0.4752345  best: 0.4752345 (900) total: 39.2s  remaining: 4.3s
999:  learn: 0.5701234   test: 0.4803456  best: 0.4803456 (999) total: 43.5s  remaining: 0s
Advertisements