LightGBM - Binary Classification



What is Binary Classification ?

Classifying data into one of two groups or classes is an objective of binary classification which is a sort of machine learning problem. With binary classification, the model predicts one of two possible results. As an example - a spam filter can recognize an email as 'spam' or 'not spam'.

One of the two classes of labeled data is used to train the model. By identifying patterns in the data, the model differentiates between the two groups. The model concludes the class of the new, unknown data.

Evaluation Metrics for Binary Classification

The following metrics are used when analyzing binary classifications −

  • Accuracy: It is defined as the percentage of all predictions that are correct.

  • Precision: The fraction of all positive forecasts that really turn out to be true positive predictions is known as precision.

  • Recall: Recall (sensitivity) is the proportion of true positive forecasts among all real positives.

  • F1-Score: The F1-Score is the harmonic mean of recall and precision.

  • Receiver Operating Characteristic - Area Under Curve: ROC-AUC measures how well the model can differentiate between the two classes.

Examples of Binary Classification

Here are some of the example of binary classification tasks −

  • Email Filtering: Email filtering means classify emails like if the mail is 'spam' or 'not spam'.

  • Disease Diagnosis: Disease diagnosis means check that a patient has a disease means result is positive or negative.

  • Sentiment Analysis: Sentiment analysis means to classify a customer review as 'positive' or 'negative'.

Implementation of Binary Classification

Here are the step you need to follow to create a basic Binary Classification using LightGBM −

Step 1: Import Libraries

Python libraries allow us to handle data and do both basic and complex tasks using a single line of code. Use the below libraries needed for data manipulation, machine learning, and assessment.

import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

Step 2: Create a Dummy Dataset

Create a DataFrame with 100 rows and four columns (feature1, feature2, feature3, and target). Here feature1 and feature2 are continuous variables and feature3 is a categorical variable with integer values. The target is a binary target variable.

#Set seed for reproducibility
np.random.seed(42)

#Create a DataFrame with random data
data = pd.DataFrame({
    'feature1': np.random.rand(100),  #100 random numbers between 0 and 1
    'feature2': np.random.rand(100),  #100 random numbers between 0 and 1
    'feature3': np.random.randint(0, 10, size=100),  #100 random integers between 0 and 9
    'target': np.random.randint(0, 2, size=100)  #Binary target variable (0 or 1)
})

print(data.head())

The result of the above code is −

Dummy Data for Binary Classification

Step 3: Split the Data

Separate the dataset into training and testing sets. 30% of the data in this case will be used for testing, while 70% is used for training.

#Split the data into training and testing sets
X = data.drop('target', axis=1)  #Features
y = data['target']  #Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Step 4: Create LightGBM Datasets

Convert the training and testing data into a LightGBM specific format. The train_data is used for training, and test_data is used for evaluation.

#Create a LightGBM dataset
train_data = lgb.Dataset(X_train, label=y_train)
test_data = lgb.Dataset(X_test, label=y_test, reference=train_data)

Step 5: Set LightGBM Parameters

Define the LightGBM model's objective function, metric, and other hyperparameters.

#Set LightGBM parameters
params = {
    'objective': 'binary',         #Binary classification task
    'metric': 'binary_error',      #Evaluation metric
    'boosting_type': 'gbdt',       #Gradient Boosting Decision Tree
    'num_leaves': 31,              #Number of leaves in one tree
    'learning_rate': 0.05,         #Step size for each iteration
    'feature_fraction': 0.9        #Fraction of features used for each iteration
}

Step 6: Train the Model

Train the LightGBM model using the given parameters. Early stopping is used to stop training if no improvement is seen for 10 rounds.

#Train the model with early stopping
bst = lgb.train(params, train_data, valid_sets=[test_data], early_stopping_rounds=10)

Step 7: Predict and Evaluate

Make some assumptions about the test set, translate the expected probabilities into binary values, and then evaluate the model's accuracy.

#Predict and evaluate the model
y_pred = bst.predict(X_test, num_iteration=bst.best_iteration)  #Predict probabilities
y_pred_binary = [1 if x > 0.5 else 0 for x in y_pred]         #Convert probabilities to binary predictions
accuracy = accuracy_score(y_test, y_pred_binary)                #Calculate accuracy

print(f"Accuracy: {accuracy:.2f}")

This will produce the following result:

Accuracy: 0.50

The accuracy score will display the LightGBM model's performance on the test set. As the dataset was created at random, the accuracy may not be very high; it is expected to be close to 0.5.

Summary

LightGBM is a useful method for resolving binary classification problems. It is very useful for large datasets with high-dimensional features. Its integrated methods for handling categorical features minimize the preprocessing workload.

Advertisements