
- LightGBM - Home
- LightGBM - Overview
- LightGBM - Architecture
- LightGBM - Installation
- LightGBM - Core Parameters
- LightGBM - Boosting Algorithms
- LightGBM - Tree Growth Strategy
- LightGBM - Dataset Structure
- LightGBM - Binary Classification
- LightGBM - Regression
- LightGBM - Ranking
- LightGBM - Implementation in Python
- LightGBM - Parameter Tuning
- LightGBM - Plotting Functionality
- LightGBM - Early Stopping Training
- LightGBM - Feature Interaction Constraints
- LightGBM vs Other Boosting Algorithms
- LightGBM Useful Resources
- LightGBM - Quick Guide
- LightGBM - Useful Resources
- LightGBM - Discussion
LightGBM - Regression
The popular machine−learning method LightGBM (Light Gradient Boosting Machine) is used for regression and classification applications. When it is used for regression, it creates a series of decision trees each trying to minimize a loss function (e.g. mean squared error) by reducing the error from the previous one.
How LightGBM Works for Regression ?
LightGBM's foundation, gradient boosting, creates several decision trees one after the other in an ordered manner. Every tree makes an effort to correct the errors made by previous ones.
Unlike other boosting algorithms, which grow trees level−wise, LightGBM builds trees leaf−wise. This shows that while expanding the model, it optimizes loss reduction (i.e., the leaf that improves the model the most). This provides a deeper and more accurate tree but it needs careful adjustment to avoid overfitting.
To reduce the difference between expected and actual results, LightGBM uses two types of loss functions for regression tasks− mean squared error (MSE) and mean absolute error (MAE).
When to Use LightGBM Regression
Here are some cases when you can use regression using LightGBM −
When a large dataset is given.
When a quick and efficient model is needed.
When your data contains a large number of characteristics (columns) or missing values.
Example of using LightGBM for Regression
Now let's have a look at how to create a LightGBM regression model. These steps will help you understand how each step of the process works.
Step 1 − Install Required Libraries
Before you start make sure you have installed the necessary libraries. Scikit-learn is needed for data manipulation and lightgbm is needed for the LightGBM model.
pip install pandas scikit-learn lightgbm
Step 2 − Load the Data
At first, the dataset is loaded using pandas. This dataset contains health related information including age, gender, BMI, number of children, location, smoking status, and medical bills.
import pandas as pd # Load the dataset from your local file path data = pd.read_csv('/My Docs/Python/medical_cost.csv') # Display the first few rows of the dataset print(data.head())
Output
This will produce the following result−
age sex bmi children smoker region charges 0 19 female 27.900 0 yes southwest 16884.92400 1 18 male 33.770 1 no southeast 1725.55230 2 28 male 33.000 3 no southeast 4449.46200 3 33 male 22.705 0 no northwest 21984.47061 4 32 male 28.880 0 no northwest 3866.85520
Step 3 − Separate Features and Target Variable
The target variable (y) and the features (X) are now being separated. In this case, we want to figure out the 'charges' column using the other features.
# 'charges' is the target column that we want to predict # All columns except 'charges' are features X = data.drop('charges', axis=1) # The 'charges' column is the target variable y = data['charges']
Step 4 − Handle Categorical Data
The categorical features in the dataset (gender, smoker, and region) need to be transformed into a numerical format because LightGBM works with numerical data. One-hot encoding is used to convert these category columns into a binary format (0s and 1s).
# Convert categorical variables to numerical X = pd.get_dummies(X, drop_first=True)
Here,
pd.get_dummies() is generating additional binary columns for each category.
drop_first=True avoids multicollinearity by eliminating each categorical variable's initial category.
Step 5 − Split the Data
To know the model's performance we will split the data into two sets − a training set means 80% of the data and a testing set means 20% of the data.
train_test split() is used to split the data randomly also maintaining the given proportions (test_size=0.2).
Using the random_state = 42 we will make sure the split can be reproducible.
from sklearn.model_selection import train_test_split # Split the data into training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 6: Initialize the LightGBM Regressor
Now we will initialize the LightGBM model for regression. The LGBMRegressor is the name of the LightGBM implementation created specifically for regression tasks. The LGBMRegressor model is very efficient and flexible, can handle large datasets effectively.
from lightgbm import LGBMRegressor # Initialize the LightGBM regressor model model = LGBMRegressor()
Step 7: Train the Model
Next we will train the model with the help of the training data (X_train and y_train). Here the fit() method is used to train the model by finding patterns in the training data and predicting the target variable (charges).
# Train the model on the training data model.fit(X_train, y_train)
Output
After running the above code we will get the following outcome −
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001000 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 319 [LightGBM] [Info] Number of data points in the train set: 1070, number of used features: 8 [LightGBM] [Info] Start training from score 13346.089733 LGBMRegressori LGBMRegressor()
Step 8: Make Predictions
After training, we use the model to make predictions for the test set (X_test). The model.predict(X_test) generates predicted values for the test set based on patterns learned from training data.
# Predict on the test set y_pred = model.predict(X_test)
Step 9: Evaluate the Model
We will measure our model's performance with Mean Squared Error (MSE), a popular regression statistic. The difference between the expected and actual numbers is calculated as the mean squared error, or MSE. Better performance is shown by a lower MSE value.
from sklearn.metrics import mean_squared_error # Calculate the MSE mse = mean_squared_error(y_test, y_pred) print(f'Mean Squared Error: {mse}')
Output
This will generate the below output −
Mean Squared Error: 20557383.0620152
Analyze the MSE number to see how well the model predicts the target variable. If the MSE is high, consider updating the model by adjusting hyperparameters or gathering new data.
Visualize the Mean Squared Error (MSE)
To see the Mean Squared Error build a bar chart with the MSE value. This provides a clear and visual representation of the problem's magnitude.
Here, you can see how to plot it using matplotlib which is a popular Python library used for plotting −
import matplotlib.pyplot as plt from sklearn.metrics import mean_squared_error # Example data (replace these with your actual values) # Actual values y_test = [3, -0.5, 2, 7] # Predicted values y_pred = [2.5, 0.0, 2, 8] # Calculate the MSE mse = mean_squared_error(y_test, y_pred) # Plotting the Mean Squared Error plt.figure(figsize=(6, 4)) plt.bar(['Mean Squared Error'], [mse], color='blue') plt.ylabel('Error Value') plt.title('Mean Squared Error (MSE)') plt.show()
Output
Here is the result of the above code −
