Linear Regression in Python using Statsmodels


Any data scientist must comprehend the fundamentals of linear regression because it is a key algorithm in machine learning and statistics. Numerous libraries in Python make it easier to implement this approach, with Statsmodels being one of the most potent. This article explores the use of linear regression using Statsmodels, using examples drawn from actual data to aid comprehension.

Understanding Linear Regression

By fitting a linear equation to the observed data, linear regression is a statistical technique that models the relationship between two variables. While one variable is the dependent variable whose change is being examined, the other is the explanatory (independent) variable.

Overview of Statsmodels

A Python package created specifically for statistics is called Statsmodels. It is built on top of other strong libraries like Matplotlib, SciPy, and NumPy. A full range of statistical tests is available through Statsmodels, which also offers robust estimates in several statistical models.

Implementing Linear Regression using Statsmodels

Make sure you have installed Statsmodels and any other required libraries before you begin −

pip install statsmodels pandas numpy matplotlib

Example 1: Simple Linear Regression

Let's begin with a straightforward illustration of linear regression in which there is just one independent variable. We'll use the mtcars dataset, which is a built-in dataset in Statsmodels, for this example. This information includes eleven characteristics of automobile performance and design for 32 different vehicles, together with fuel consumption data (mpg).

First, load the data and import the relevant libraries:

import statsmodels.api as sm
import matplotlib.pyplot as plt
import pandas as pd

# load mtcars dataset
mtcars = sm.datasets.get_rdataset("mtcars").data

Let's now fit a straightforward linear regression model in which we attempt to forecast mpg using wt (vehicle weight):

# Define dependent and independent variables
X = mtcars["wt"]
y = mtcars["mpg"]

# Add a constant to the independent value
X = sm.add_constant(X)

# Perform linear regression
model = sm.OLS(y, X)
results = model.fit()

# Print out the statistics
print(results.summary())

In this illustration, the linear regression model is fitted using sm.OLS, and the summary method provides comprehensive information on the model fit.

Example 2: Multiple Linear Regression

Let's now move on to a little more challenging scenario in which we have a number of independent variables. In this instance, we'll calculate mpg using weight and horsepower.

# Define dependent and independent variables
X = mtcars[["wt", "hp"]]
y = mtcars["mpg"]

# Add a constant to the independent value
X = sm.add_constant(X)

# Perform linear regression
model = sm.OLS(y, X)
results = model.fit()

# Print out the statistics
print(results.summary())

The X now has two columns, one for weight and the other for horsepower.

Example 3: Plotting the Results

Let's finally see our regression model from the first example in visual form. We'll add the regression line to the original data (mpg vs. wt) and plot it.

# Plot the original data
plt.scatter(mtcars["wt"], mtcars["mpg"])

# Plot the regression line
plt.plot(mtcars["wt"], results.fittedvalues, 'r')

# Set the labels and show the plot
plt.xlabel('wt')
plt.ylabel('mpg')
plt.title('Linear Regression Plot of mpg vs wt')
plt.show()

Using the fitted values from our model, we added the fitted linear regression line (in red) to the scatter plot of the original data created by matplotlib in this piece of code. The plot illustrates the connection between mpg and weight visually.

Interpreting the Results

Several statistical measures are included in the summary of our regression. The coefficient of weight (wt) shows us, while holding other variables fixed, how much the mpg will drop with each additional unit of weight. The R-squared calculates the percentage of the variation in mpg that the model can account for. The model is more accurate the closer R-squared is to 1.

The p-value evaluates whether the coefficient is equal to zero (no effect), which is the null hypothesis. If your p-value is low ( 0.05), you can rule out the null hypothesis.

Conclusion

A potent statistical method called linear regression makes predictions based on correlations between several data variables possible. Predictive models can be created with only a few lines of code thanks to the powerful functionality provided by Python's Statsmodels package for the implementation of linear regression models.

The three examples in this article show readers how to design simple and numerous linear regression models using Statsmodels and how to view the regression line on a scatter plot. You may learn more about linear regression and how to use it in Python using the Statsmodels package by working through these examples.

Keep in mind that real-world data frequently contains several variables and may necessitate more complicated models. So think of this as a first step towards more sophisticated Python data analysis.

Updated on: 18-Jul-2023

318 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements