LightGBM - Core Parameters



These are the main settings or choices that can be changed when using a machine learning model like LightGBM. They control how the model learns from the data, which has an important impact on the model's accuracy and performance.

Using key parameters, the LightGBM model can be customized for your specific data, task, and limitations. By changing these parameters, you can optimize the model's efficiency, speed, and generalization ability.

Why Core Parameters Are Used

Core parameters in LightGBM help you to −

  • Control Model Complexity: Limit tree size and depth to avoid making the model too simple (missing patterns) or too complex (fitting noise).

  • Improve Accuracy: Change the model's learning process, like how fast it should learn, to quickly discover the best option.

  • Prevent Over fitting: Use limitations or penalties to keep the model from learning the noise in the data instead of the underlying patterns.

  • Speed up training: To speed up training by deciding how much data and how many features to use in each stage.

  • Fit Different Tasks: Choose the best settings for your specific problem, like regression or classification, and properly monitor performance.

Core Parameters in LightGBM

Here we will focus on the core LightGBM parameters that control the model's behavior −

1. boosting_type (default = 'gbdt')

This parameter controls the boosting technique used in the training process. Options are as follows −

  • 'gbdt' (Gradient Boosting Decision Tree): The default method, called "gradient boosting decision tree," or "gbdt," builds decision trees one after the other using gradient boosting.

  • 'dart' (Dropouts meet Multiple Additive Regression Trees): During training, some trees are randomly eliminated to prevent over-fitting.

  • 'goss' (Gradient-based One-Side Sampling): Significant data points with larger gradients are selected to speed up training.

  • 'rf' (Random Forest): The Random Forest, known as "rf," creates trees independently and aggregates its predictions.

2. objective

Creates the loss function or objective function that LightGBM will try to optimize.

  • 'regression': A technique used to predict continuous variables, such house values.

  • 'binary': Use "binary" for jobs involving binary categorization (for example, yes/no, spam/ham).

  • 'multi-class': To refer to problems involving multi-class categorization, use "multi-class."

3. metric

This parameter offers the evaluation metric that will be used to evaluate the model's performance.

  • 'binary_logloss': The logarithmic loss for binary classification.

  • 'auc': Area under the ROC Curve, mainly used in classification tasks.

  • 'rmse': Refers to the Root Mean Squared Error in regression situations.

4. learning_rate (default = 0.1)

This core parameter controls the step size at each iteration while moving towards a minimum of the loss function.

  • Lower values (like 0.01) denote a slower learning rate but higher accuracy.

  • Higher numbers (e.g., 0.1) may allow for faster learning, but there is a risk of missing the optimal response.

5. num_iterations (default = 100)

It is mainly used to set the number of boosting iterations the model should run. Higher values mean more boosting rounds and better learning but it can increase training time.

6. num_leaves (default = 31)

It is used to determine the complexity of each tree.

  • Higher values provide more complex trees, but may lead to over-fitting.

  • Lower values simplify trees, reducing the possibility of over-fitting.

7. max_depth (default = -1)

It is mainly used to limit the maximum depth of the tree. And if it is set to -1 means there is no limit.

  • Lower values (such as 3 or 5) reduce the depth, reducing the model.

  • Higher numbers allow for deeper trees, which can detect more complex patterns but may over-fit.

8. min_data_in_leaf (default = 20)

It is the minimum number of data points required in a leaf.

  • Higher numbers lower the possibility of over-fitting by making sure that each leaf has enough data points.

  • Lower values can improve model flexibility while increasing the danger of over-fitting.

9. feature_fraction (default = 1.0)

It is used to control how many features are used to train each tree.

  • A score of 1.0 shows complete use of all features.

  • Values less than 1.0 randomly select a set of subsets, hence preventing over-fitting.

10. bagging_fraction (default = 1.0)

Determines the portion of data points used for training in each iteration.

  • A value of 1.0 represents all data points.

  • Lower values contain a random subset, which increases randomness and helps to prevent over-fitting.

11. bagging_freq (default = 0)

Determines the frequency of bagging. If set to a positive value, bagging is enabled, and data is chosen at random per bagging_freq cycle.

12. lambda_l1 & lambda_l2 (default = 0.0)

It controls both L1 and L2 regularization separately. Higher values add regularization to the model, preventing over-fitting by penalizing large values.

13. min_gain_to_split (default = 0.0)

It is the minimal gain needed for creating another division on a leaf node. Higher values create the model more conservative, which prevents over-fitting.

Implementing LightGBM using Core Parameters

Let's use LightGBM to build a model using these core parameters for the Breast Cancer dataset −

Installing LightGBM

First run the below command to install LightGBM in your system −

pip install lightgbm

Importing Libraries and Load Data

After installing the package we will import the required libraries and load the data −

import lightgbm as lgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Defining Core Parameters

Now let us define the core parameters for our model −

# Define LightGBM parameters
params = {
   'boosting_type': 'gbdt',
   'objective': 'binary',
   'metric': 'binary_logloss',
   'num_leaves': 31,
   'learning_rate': 0.05,
   'num_iterations': 100,
   'max_depth': 5,
   'feature_fraction': 0.8,
   'bagging_fraction': 0.8,
   'bagging_freq': 5,
   'lambda_l1': 0.1,
   'lambda_l2': 0.1,
   'min_gain_to_split': 0.01
}

Preparing Data for LightGBM

In this stage we are required to prepare the data for Light Gradient Boosting Machine.

# Prepare data for LightGBM
train_data = lgb.Dataset(X_train, label=y_train)

Training the Model

Now train the model using the prepared dataset −

# Train the LightGBM model
model = lgb.train(params, train_data)

Making Predictions and Evaluating the Model

Now you have to use the trained model to make predictions and evaluate its accuracy −

# Make predictions
y_pred = model.predict(X_test)
y_pred_binary = [1 if pred > 0.5 else 0 for pred in y_pred]

# Evaluate model accuracy
accuracy = accuracy_score(y_test, y_pred_binary)
print(f"Accuracy is as follows: {accuracy}")

This will lead to the following outcome:

Accuracy is as follows: 0.9737
Advertisements