LightGBM - Ranking



Ranking means placing elements in a specified order, like sorting students by grades or ordering search results with relevance. In machine learning, ranking is used to organize items based on their value or relevancy.

LightGBM can be used for ranking tasks which demand arranging data in an ordered manner. This is helpful in a number of events, like −

  • Search Engines − When you search some query on Google then the results are ordered as per their preferences with the query you have entered.

  • Recommended Systems − When you watch YouTube videos or shop online so the system ranks the options and proposes the ones that are most relevant to you.

Ranking Loss Functions in LightGBM

When LightGBM is used to rank, it try to put them in the most optimal order. In order to do this, LightGBM uses a "loss function." A loss function determines how well the model completes its mission. If the ranking is correct, the loss is minimal; otherwise, the loss is large. The objective is to minimize the loss function by ranking accurately as possible.

Here are some ranking loss functions we can use in LightGBM −

LambdaRank

This loss function tries to improve the relevance of search results and recommendations. This technique transforms ranking into a pairwise classification or regression problem. Basically, the algorithms evaluate a pair of items at a time to find a possible ordering for those items before starting with the final order of the complete list. LambdaRank is popular as it just ranks quality.

NDCG (Normalized Discounted Cumulative Gain)

NDCG is a statistic that finds the quality of a ranking list. It selects items near the top of the list as they are the most important. LightGBM uses NDCG as a loss function to improve its rankings. The goal is to maximize the NDCG score by displaying the most important information at the top. This is useful for search engines and recommendation systems that depend largely on the first few results.

MAP (Mean Average Precision)

Mean average precision measures how well a model performs a query. To better understand how it works, consider precision and recall, which are two often used measures for evaluating the effectiveness of a classification model. It is useful for making sure a big number of relevant items display at the top.

List-wise Loss

Instead of depending only on pairs, list-wise loss functions evaluate the entire set of rated items. This technique evaluates the overall quality of the ranking list and tries to improve it. LightGBM uses listwise loss functions to find the best ranking order for all items in a group.

Example using LightGBM for Ranking

Here is a Python code which is showing the example of LightGBM for ranking. So we will create a small dataset and then train a LightGBM model for ranking. And after that we will use it for predicting the ranking order.

  • Step 1 − First you have to import the necessary libraries like − lightgbm, numpy, sklearn.model_selection, and sklearn.metrics.

    import lightgbm as lgb
    import numpy as np
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import ndcg_score
    
  • Step 2 − Create a sample dataset with X, y and group here X is features, y is relevance scores, and group is groups. X is the feature matrix which has seven samples and two features each. Y shows the relevance scores. Its higher value means the item is more relevant.

    X = np.array([[0.2, 1], [0.4, 2], [0.3, 1], [0.6, 2], [0.8, 3], [0.5, 2], [0.9, 3]])
    y = np.array([1, 2, 2, 3, 4, 3, 5])
    group = [2, 3, 2]
    
  • Step 3 − Expand the group array to create a list of group indices for each sample in X. The group_indices is created by repeating the group indices for each item in its respective group.

    group_indices = np.repeat(range(len(group)), group)
    
  • Step 4 − The train_test_split method splits the dataset into training and testing sets. X, y, and group_indices are splitted into training and testing sets. And we will split them in a 70:30 ratio.

    X_train, X_test, y_train, y_test, group_train_indices, group_test_indices = train_test_split(
       X, y, group_indices, test_size=0.3, random_state=42
    )
    
  • Step 5 − Count how many samples are in each group for the training and testing datasets. The group_train and group_test give the number of samples in each training and testing group.

    group_train = [np.sum(group_train_indices == i) for i in np.unique(group_train_indices)]
    group_test = [np.sum(group_test_indices == i) for i in np.unique(group_test_indices)]
    
  • Step 6 − Now we will generate LightGBM datasets for training and testing. The group parameter shows the number of samples per group which is required for ranking tasks.

    train_data = lgb.Dataset(X_train, label=y_train, group=group_train)
    test_data = lgb.Dataset(X_test, label=y_test, group=group_test)
    
  • Step 7 − Then we will define the parameters for the LightGBM model −

    params = {
       'objective': 'lambdarank',
       'metric': 'ndcg',
       'learning_rate': 0.1,
       'num_leaves': 31,
       'min_data_in_leaf': 1,
       'ndcg_at': [1, 3, 5],
       'verbose': -1
    }
    
  • Step 8 − And then train the LightGBM model with the help of the training data.

    gbm = lgb.train(
       params,
       train_data,
       valid_sets=[test_data],
       num_boost_round=100,
       callbacks=[lgb.early_stopping(stopping_rounds=10)]
    )
    
  • Step 9 − By using the training model to predict the test data. We have used the ndcg_score to get the performance of the model.

    y_pred = gbm.predict(X_test)
    score = ndcg_score([y_test], [y_pred])
    print(f"NDCG Score: {score}")
    
  • Step 10 − Here is the score to understand the model ranks the test data.

    Training until validation scores don't improve for 10 rounds
    Early stopping, best iteration is:
    [1]	valid_0's ndcg@1: 0.666667	valid_0's ndcg@3: 0.898354	valid_0's ndcg@5: 0.898354
    NDCG Score: 0.894999002123018
    

Advantages of Using LightGBM for Ranking

Here are some benefits on why LightGBM is a great choice for ranking tasks:

  • Speed − LightGBM is very quick. It can handle large amounts of data easily which is essential when there are a large number of items to evaluate like thousands of products or millions of web pages. Because of its speed you get results faster so it is important for companies that need quick decisions.

  • Memory Efficient − LightGBM is more memory efficient than other machine learning tools. It means that the LightGBM can run on computers with less powerful hardware. It takes not much space to store data while learning what makes it best for a wide range of devices like laptops to large servers.

  • Accuracy − LightGBM is great at creating precise predictions. It learns the most effective method for ranking items so the final order is both accurate and useful. This high level of accuracy improves the user experience by providing the most relevant products in an online store or the top search results on a search engine.

  • Handles Missing Data − Sometimes your data is incomplete or contains missing values. LightGBM can handle missing data effectively without major cleanup. Even if some information is missing, the model can continue to learn from the data and make accurate ranking predictions.

Advertisements