
- CatBoost - Home
- CatBoost - Overview
- CatBoost - Architecture
- CatBoost - Installation
- CatBoost - Features
- CatBoost - Decision Trees
- CatBoost - Boosting Process
- CatBoost - Core Parameters
- CatBoost - Data Preprocessing
- CatBoost - Handling Categorical Features
- CatBoost - Handling Missing Values
- CatBoost - Classifier
- CatBoost - Model Training
- CatBoost - Metrics for Model Evaluation
- CatBoost - Classification Metrics
- CatBoost - Over-fitting Detection
- CatBoost vs Other Boosting Algorithms
- CatBoost Useful Resources
- CatBoost - Quick Guide
- CatBoost - Useful Resources
- CatBoost - Discussion
CatBoost - Handling Missing Values
Missing values mean that some data is not available in a dataset. This can happen for different reasons, like mistakes in collecting the data or on purpose leaving out certain information. To build an accurate predictive model we have to manage them carefully. In datasets, typical missing values are represented in two ways, explained below:
NaN (Not a Number): In numeric datasets, NaN is often used to represent missing or undefined values. The IEEE standard defines NaN, a specific floating-point value that is frequently used in programming languages like Python and libraries like NumPy.
NULL or NA: In database systems and statistical software, NULL or NA can be used to identify missing values. These are just placeholders that indicate a lack of data for a particular observation.
CatBoost by Handling Missing Values
CatBoost can handle missing values on its own so you do not have to fix them yourself. Here is how it works and how to use it −
1. Install CatBoost Library
First you have to make sure that you have installed CatBoost library or not. If it is not installed then you can install it using the below command −
pip install catboost
2. Import the Libraries
After installing you can use CatBoost in your model. So import necessary Libraries like NumPy, Pandas, Matplotlib, Seaborn and SKlearn etc, in your code like the below −
import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from catboost import CatBoostRegressor, Pool from sklearn.metrics import mean_absolute_error, r2_score from sklearn.model_selection import train_test_split
3. Loading the Dataset
Now we load a dataset from system directory, here we are using house prices dataset for implementing the model. Then, we will divide it into training and testing sets and prepare categorical features to provide to CatBoost during training.
# Load the dataset data = pd.read_csv('/Python/Datasets/train.csv') # Select features and target variable features = data.columns.difference(['SalePrice']) # All columns except 'SalePrice' target = 'SalePrice' # Change categorical features to strings categorical_features = data[features].select_dtypes(include=['object']).columns for feature in categorical_features: data[feature] = data[feature].astype(str) # Split data into features and target X = data[features] y = data[target] # Split the data into training and testing datasets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Figure out categorical features categorical_features_indices = np.where(X.dtypes == 'object')[0]
This code loads and prepares the House Price dataset for modeling. It converts the category data into strings. The data is then separated into two parts: features (X) and target (Y). After that, the data is separated into two sets: training (80%) and testing (20%). CatBoost uses the variable categorical_features_indices to find out which features are categorical, allowing it to properly manage them during training.
Exploratory Data Analysis
Exploratory Data Analysis (EDA) allows us to get a deeper understanding of the dataset.
Checking missing values
This is very relevant to this chapter and useful for any dataset. If missing values are not properly handled, they have an impact on the model's predictions. Here we will see which columns of our dataset have missing values and as well as the total count.
# Check for missing values missing_values = data.isnull().sum().sort_values(ascending=False) missing_values = missing_values[missing_values > 0] print("\nMissing Values Columns:\n", missing_values)
Output
Here is the outcome of the model −
Missing Values Columns: PoolQC 1453 MiscFeature 1406 Alley 1369 Fence 1179 FireplaceQu 690 LotFrontage 259 GarageYrBlt 81 GarageCond 81 GarageType 81 GarageFinish 81 GarageQual 81 BsmtFinType2 38 BsmtExposure 38 BsmtQual 37 BsmtCond 37 BsmtFinType1 37 MasVnrArea 8 MasVnrType 8 Electrical 1 dtype: int64
To check for missing values in the 'data' DataFrame this code sums the null values for each column. The columns are then printed with the relevant counts, but only for those with missing values greater than zero. This is done by arranging the columns in descending order as per the amount of missing data.
CatBoost Handling Imbalanced Classes
Many real-world applications depend on imbalanced datasets, like fraud detection, medical diagnosis, and revenue loss prediction. In these cases, one class is considerably neglected in comparison to the other. The gap can lead to biased models which benefit the dominant class, resulting in low performance by the minority class.
Methods for Handling Imbalanced Data in CatBoost
CatBoost has several built-in solutions for dealing with imbalanced datasets. This includes −
Automatic Class Weights
Balanced Accuracy Metric
Oversampling or Undersampling
Use the scale_pos_weight Parameter
Early Stopping
Let us look at a real-life example of how to handle an imbalanced dataset with CatBoost and then test its performance. We will use a synthetic dataset to evaluate the efficacy of different methods.
Automatic Class Weights
You can assign different weights to different classes in order to give the minority class more weight. This can be done by using the CatBoost class_weights parameter, which help the model to focus more on the minority class.
catboost_params = { 'iterations': 500, 'learning_rate': 0.05, 'depth': 6, 'loss_function': 'Logloss', # Higher weight for minority class 'class_weights': [1, 10] } model = CatBoostClassifier(**catboost_params) model.fit(X_train, y_train)
Balanced Accuracy Metric
Using evaluation methods that adjust for uneven data is important. The Balanced Accuracy measure, which takes into account both classes, can be used to evaluate models.
from sklearn.metrics import balanced_accuracy_score y_pred = model.predict(X_test) balanced_acc = balanced_accuracy_score(y_test, y_pred) print("Balanced Accuracy:", balanced_acc)
Oversampling or Undersampling
Before training the model, you can either oversample the minority class or undersample the majority class to balance the dataset. Synthetic Minority Over-sampling Technique is one technique that can be used to create synthetic samples for the minority class.
from imblearn.over_sampling import SMOTE sm = SMOTE() X_resampled, y_resampled = sm.fit_resample(X_train, y_train) model = CatBoostClassifier(iterations=500, learning_rate=0.05, depth=6) model.fit(X_resampled, y_resampled)
Use the scale_pos_weight Parameter
This parameter is very useful when the dataset is very much imbalanced. It basically adjusts the loss function for the minority class to decrease the imbalance.
catboost_params = { 'iterations': 500, 'learning_rate': 0.05, 'depth': 6, 'loss_function': 'Logloss', # Increase for the minority class 'scale_pos_weight': 10 } model = CatBoostClassifier(**catboost_params) model.fit(X_train, y_train)
Early Stopping
In highly imbalanced data sets, early stopping is very helpful in preventing the model from overfitting to the majority class.
model = CatBoostClassifier(iterations=500, learning_rate=0.05, depth=6) model.fit(X_train, y_train, eval_set=(X_val, y_val), early_stopping_rounds=50)