XGBoost - Quantile Regression



XGBoost is used to predict one primary value at a time, like the average of all possible outcomes. At times, we try to understand every possibility, including the worst-case and best-case situations. This is how Quantile Regression is used.

This is how quantile loss functions are used to train independent XGBoost models. For example- you can train models for the 0.05, 0.5, and 0.95 quantiles to get the lower and upper bounds of the prediction interval.

Because of quantile regression, we can predict other points in the data, or "quantiles," in addition to the mean (average). As an example: 10th percentile (a poor outcome), 50th percentile (average outcome), and 90th percentile (acceptable outcome).

How does Quantile Regression work in XGBoost?

XGBoost regularly reduces errors by focusing its predictions on average. When we use XGBoost combined with Quantile Regression, we adjust the error measurement. Rather than focusing on total errors, we highlight the gap between specific quantiles and predictions.

In simple terms, using Quantile Regression with XGBoost −

  • It predicts a value for a given percentile.

  • For a number of cases, we can calculate possible outcomes (bad, average, and good).

For example, this can be useful when creating strategies for both the best and worst case scenarios when it comes to financial estimates.

Quantile Regression with XGBoost

We will import the required libraries to build quantile regression with the help of XGBoost to produce prediction intervals.

import xgboost as xgb
import numpy as np
import matplotlib.pyplot as plt

For training and testing, targets and features are generated from random distributions with the help of synthetic data.

# Generate synthetic data
np.random.seed(42)
X_train = np.random.rand(100, 10)
y_train = np.random.rand(100)
X_test = np.random.rand(20, 10)

To calculate the gradient and Hessian needed to satisfy the XGBoost regressor's objective, a customized quantile loss function is created. Three different quantiles are used to train the models- 0.05, 0.5 (median), and 0.95. These quantiles relate to the lower, median, and upper boundaries of the prediction interval, respectively. After training, each quantile makes predictions on the test set.

def quantile_loss(quantile_value):
   def loss(true_values, predicted_values):
      error = true_values - predicted_values
      gradient = np.where(error > 0, quantile_value, quantile_value - 1)
      # Hessian is constant
      hessian = np.ones_like(error)  
      return gradient, hessian
   return loss

quantile_levels = [0.05, 0.5, 0.95]
regression_models = {}

for quantile in quantile_levels:
   regressor = xgb.XGBRegressor(objective=quantile_loss(quantile))
   regressor.fit(X_train, y_train)
   regression_models[quantile] = regressor

# Predicting quantiles
predictions_05 = regression_models[0.05].predict(X_test)
predictions_50 = regression_models[0.5].predict(X_test)
predictions_95 = regression_models[0.95].predict(X_test)

# Lower and upper bounds
lower_prediction = predictions_05
upper_prediction = predictions_95
median_prediction = predictions_50

We can see the data and effectively display the prediction interval around the median predictions by plotting the median prediction and filling the gap between the lower and upper boundaries.

# Visualization
plt.figure(figsize=(10, 6))
plt.plot(median_prediction, label='Median Prediction', color='green')  
plt.fill_between(range(len(median_prediction)), lower_prediction, upper_prediction, color='lightcoral', alpha=0.5, label='Prediction Interval') 
plt.title('Quantile Regression Prediction Interval')
plt.xlabel('Test Data Points')
plt.ylabel('Predictions')
plt.legend()
plt.show()

Output

Here is the outcome of the above model −

Plotting the Median Prediction
Advertisements