Machine Learning - Simple Linear Regression

Simple linear regression is a type of regression analysis in which a single independent variable (also known as a predictor variable) is used to predict the dependent variable. In other words, it models the linear relationship between the dependent variable and a single independent variable.

Python Implementation

Given below is an example that shows how to implement simple linear regression using the Pima-Indian-Diabetes dataset in Python. We will also plot the regression line.

Data Preparation

First, we need to import the Diabetes dataset from scikit-learn and split it into training and testing sets. We will use 80% of the data for training the model and the remaining 20% for testing.

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

# Load the Diabetes dataset
diabetes = load_diabetes()

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(diabetes.data[:, 2],
diabetes.target, test_size=0.2, random_state=0)

# Reshape the input data
X_train = X_train.reshape(-1, 1)
X_test = X_test.reshape(-1, 1)

Here, we are using the third feature (column) of the dataset, which represents the mean blood pressure, as our independent variable (predictor variable) and the target variable as our dependent variable (response variable).

Model Training

We will use scikit-learn's LinearRegression class to train a simple linear regression model on the training data. The code for this is as follows −

from sklearn.linear_model import LinearRegression
# Create a linear regression object

lr_model = LinearRegression()
# Fit the model on the training data
lr_model.fit(X_train, y_train)

Here, X_train represents the input feature (mean blood pressure) of the training data and y_train represents the output variable (target variable).

Model Testing

Once the model is trained, we can use it to make predictions on the test data. The code for this is as follows −

# Make predictions on the test data

y_pred = lr_model.predict(X_test)

Here, X_test represents the input feature of the test data and y_pred represents the predicted output variable (target variable).

Model Evaluation

We need to evaluate the performance of the model to determine its accuracy. We will use the mean squared error (MSE) and the coefficient of determination (R^2) as evaluation metrics. The code for this is as follows −

from sklearn.metrics import mean_squared_error, r2_score

# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)

# Calculate the coefficient of determination
r2 = r2_score(y_test, y_pred)

print('Mean Squared Error:', mse)
print('Coefficient of Determination:', r2)

Here, y_test represents the actual output variable of the test data.

Plotting the Regression Line

We can also visualize the regression line to see how well it fits the data. The code for this is as follows −

import matplotlib.pyplot as plt

# Plot the training data
plt.scatter(X_train, y_train, color='gray')

# Plot the regression line
plt.plot(X_train, lr_model.predict(X_train), color='red', linewidth=2)

# Add axis labels
plt.xlabel('Mean Blood Pressure')
plt.ylabel('Disease Progression')

# Show the plot
plt.show()

Here, we are using the scatter() function from the matplotlib library to plot the training data points and the plot() function to plot the regression line. The xlabel() and ylabel() functions are used to label the x-axis and y-axis of the plot, respectively. Finally, we use the show() function to display the plot.

Complete Implementation Example

The complete code for implementing simple linear regression in Python is as follows −

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Load the Diabetes dataset
diabetes = load_diabetes()

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(diabetes.data[:, 2],
diabetes.target, test_size=0.2, random_state=0)

# Reshape the input data
X_train = X_train.reshape(-1, 1)
X_test = X_test.reshape(-1, 1)

# Create a linear regression object
lr_model = LinearRegression()

# Fit the model on the training data
lr_model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = lr_model.predict(X_test)

# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)

# Calculate the coefficient of determination
r2 = r2_score(y_test, y_pred)

print('Mean Squared Error:', mse)
print('Coefficient of Determination:', r2)

# Plot the training data
plt.figure(figsize=(7.5, 3.5))
plt.scatter(X_train, y_train, color='gray')

# Plot the regression line
plt.plot(X_train, lr_model.predict(X_train), color='red', linewidth=2)

# Add axis labels
plt.xlabel('Mean Blood Pressure')
plt.ylabel('Disease Progression')

# Show the plot
plt.show()

Output

On executing this code, you will get the following plot as the output and it will also print the Mean Squared Error and the Coefficient of Determination on the terminal −

Mean Squared Error: 4150.680189329983
Coefficient of Determination: 0.19057346847560164