CatBoost - Handling Missing Values



Missing values mean that some data is not available in a dataset. This can happen for different reasons, like mistakes in collecting the data or on purpose leaving out certain information. To build an accurate predictive model we have to manage them carefully. In datasets, typical missing values are represented in two ways, explained below:

  • NaN (Not a Number): In numeric datasets, NaN is often used to represent missing or undefined values. The IEEE standard defines NaN, a specific floating-point value that is frequently used in programming languages like Python and libraries like NumPy.

  • NULL or NA: In database systems and statistical software, NULL or NA can be used to identify missing values. These are just placeholders that indicate a lack of data for a particular observation.

CatBoost by Handling Missing Values

CatBoost can handle missing values on its own so you do not have to fix them yourself. Here is how it works and how to use it −

1. Install CatBoost Library

First you have to make sure that you have installed CatBoost library or not. If it is not installed then you can install it using the below command −

pip install catboost

2. Import the Libraries

After installing you can use CatBoost in your model. So import necessary Libraries like NumPy, Pandas, Matplotlib, Seaborn and SKlearn etc, in your code like the below −

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from catboost import CatBoostRegressor, Pool
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split

3. Loading the Dataset

Now we load a dataset from system directory, here we are using house prices dataset for implementing the model. Then, we will divide it into training and testing sets and prepare categorical features to provide to CatBoost during training.

# Load the dataset
data = pd.read_csv('/Python/Datasets/train.csv')
# Select features and target variable
features = data.columns.difference(['SalePrice']) # All columns except 'SalePrice'
target = 'SalePrice'

# Change categorical features to strings
categorical_features = data[features].select_dtypes(include=['object']).columns
for feature in categorical_features:
	data[feature] = data[feature].astype(str)

# Split data into features and target
X = data[features]
y = data[target]

# Split the data into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Figure out categorical features 
categorical_features_indices = np.where(X.dtypes == 'object')[0]

This code loads and prepares the House Price dataset for modeling. It converts the category data into strings. The data is then separated into two parts: features (X) and target (Y). After that, the data is separated into two sets: training (80%) and testing (20%). CatBoost uses the variable categorical_features_indices to find out which features are categorical, allowing it to properly manage them during training.

Exploratory Data Analysis

Exploratory Data Analysis (EDA) allows us to get a deeper understanding of the dataset.

Checking missing values

This is very relevant to this chapter and useful for any dataset. If missing values are not properly handled, they have an impact on the model's predictions. Here we will see which columns of our dataset have missing values and as well as the total count.

# Check for missing values
missing_values = data.isnull().sum().sort_values(ascending=False)
missing_values = missing_values[missing_values > 0]
print("\nMissing Values Columns:\n", missing_values)

Output

Here is the outcome of the model −

Missing Values Columns:
 PoolQC              1453
MiscFeature     1406
Alley                  1369
Fence                1179
FireplaceQu     690
LotFrontage      259
GarageYrBlt       81
GarageCond     81
GarageType      81
GarageFinish    81
GarageQual      81
BsmtFinType2    38
BsmtExposure   38
BsmtQual           37
BsmtCond          37
BsmtFinType1     37
MasVnrArea        8
MasVnrType         8
Electrical              1
dtype: int64

To check for missing values in the 'data' DataFrame this code sums the null values for each column. The columns are then printed with the relevant counts, but only for those with missing values greater than zero. This is done by arranging the columns in descending order as per the amount of missing data.

CatBoost Handling Imbalanced Classes

Many real-world applications depend on imbalanced datasets, like fraud detection, medical diagnosis, and revenue loss prediction. In these cases, one class is considerably neglected in comparison to the other. The gap can lead to biased models which benefit the dominant class, resulting in low performance by the minority class.

Methods for Handling Imbalanced Data in CatBoost

CatBoost has several built-in solutions for dealing with imbalanced datasets. This includes −

  • Automatic Class Weights

  • Balanced Accuracy Metric

  • Oversampling or Undersampling

  • Use the scale_pos_weight Parameter

  • Early Stopping

Let us look at a real-life example of how to handle an imbalanced dataset with CatBoost and then test its performance. We will use a synthetic dataset to evaluate the efficacy of different methods.

Automatic Class Weights

You can assign different weights to different classes in order to give the minority class more weight. This can be done by using the CatBoost class_weights parameter, which help the model to focus more on the minority class.

catboost_params = {
   'iterations': 500,
   'learning_rate': 0.05,
   'depth': 6,
   'loss_function': 'Logloss',
   # Higher weight for minority class   
   'class_weights': [1, 10]  
}
model = CatBoostClassifier(**catboost_params)
model.fit(X_train, y_train)

Balanced Accuracy Metric

Using evaluation methods that adjust for uneven data is important. The Balanced Accuracy measure, which takes into account both classes, can be used to evaluate models.

from sklearn.metrics import balanced_accuracy_score

y_pred = model.predict(X_test)
balanced_acc = balanced_accuracy_score(y_test, y_pred)
print("Balanced Accuracy:", balanced_acc)

Oversampling or Undersampling

Before training the model, you can either oversample the minority class or undersample the majority class to balance the dataset. Synthetic Minority Over-sampling Technique is one technique that can be used to create synthetic samples for the minority class.

from imblearn.over_sampling import SMOTE

sm = SMOTE()
X_resampled, y_resampled = sm.fit_resample(X_train, y_train)

model = CatBoostClassifier(iterations=500, learning_rate=0.05, depth=6)
model.fit(X_resampled, y_resampled)

Use the scale_pos_weight Parameter

This parameter is very useful when the dataset is very much imbalanced. It basically adjusts the loss function for the minority class to decrease the imbalance.

catboost_params = {
   'iterations': 500,
   'learning_rate': 0.05,
   'depth': 6,
   'loss_function': 'Logloss',
   # Increase for the minority class
   'scale_pos_weight': 10  
}
model = CatBoostClassifier(**catboost_params)
model.fit(X_train, y_train)

Early Stopping

In highly imbalanced data sets, early stopping is very helpful in preventing the model from overfitting to the majority class.

model = CatBoostClassifier(iterations=500, learning_rate=0.05, depth=6)
model.fit(X_train, y_train, eval_set=(X_val, y_val), early_stopping_rounds=50)
Advertisements