Article Categories

Selected Reading

Linear Regression using Python?

Python Server Side Programming Programming

Linear regression is one of the simplest standard tools in machine learning to indicate if there is a positive or negative relationship between two variables. It's also one of the few good tools for quick predictive analysis. In this tutorial, we'll use Python pandas package to load data and then estimate, interpret and visualize linear regression models.

What is Regression?

Regression is a form of predictive modelling technique which helps in creating a relationship between a dependent and independent variable.

Types of Regression

Linear Regression
Logistic Regression
Polynomial Regression
Stepwise Regression

Where is Linear Regression Used?

Evaluating Trends and Sales Estimates
Analysing the Impact of Price Changes
Assessing Risk

Import Required Libraries and Dataset

First, we'll import the necessary libraries and read the dataset using pandas ?

# Importing Necessary Libraries
import pandas as pd
import numpy as np
from scipy import stats
from datetime import datetime
from sklearn import preprocessing
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

# For this example, we'll create sample data similar to building power consumption
np.random.seed(42)
dates = pd.date_range('2010-01-01', periods=1000, freq='D')
oat_temp = np.random.normal(50, 20, 1000)  # Outdoor Air Temperature
power = 100 + 0.5 * oat_temp + np.random.normal(0, 10, 1000)  # Power consumption

df = pd.DataFrame({
    'OAT (F)': oat_temp,
    'Power (kW)': power
}, index=dates)

print("Dataset shape:", df.shape)
print(df.head())

Dataset shape: (1000, 2)
                OAT (F)     Power (kW)
2010-01-01   59.967142     125.418649
2010-01-02   51.617357     117.439765
2010-01-03   64.464489     129.011624
2010-01-04   47.627840     109.103883
2010-01-05   45.227356     120.850361

Exploring the Dataset

Let's first visualize our dataset by plotting it with pandas ?

import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.plot(df.index, df['OAT (F)'])
plt.title('Outdoor Air Temperature')
plt.ylabel('Temperature (F)')

plt.subplot(1, 2, 2)
plt.plot(df.index, df['Power (kW)'])
plt.title('Power Consumption')
plt.ylabel('Power (kW)')
plt.tight_layout()
plt.show()

# Check for missing values
print("Missing values:", df.isnull().values.any())

Missing values: False

Data Distribution Analysis

Let's examine the distribution of our variables using histograms ?

plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.hist(df['OAT (F)'], bins=30, alpha=0.7)
plt.title('OAT Distribution')
plt.xlabel('Temperature (F)')

plt.subplot(1, 2, 2)
plt.hist(df['Power (kW)'], bins=30, alpha=0.7)
plt.title('Power Distribution')
plt.xlabel('Power (kW)')
plt.tight_layout()
plt.show()

Removing Outliers

Let's remove outliers that are more than 3 standard deviations from the mean ?

std_dev = 3
df_clean = df[(np.abs(stats.zscore(df)) < float(std_dev)).all(axis=1)]
print(f"Original size: {len(df)}, After removing outliers: {len(df_clean)}")

plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(df.index, df['Power (kW)'], alpha=0.7, label='Original')
plt.title('Original Data')
plt.ylabel('Power (kW)')

plt.subplot(1, 2, 2)
plt.plot(df_clean.index, df_clean['Power (kW)'], alpha=0.7, label='Cleaned', color='orange')
plt.title('After Removing Outliers')
plt.ylabel('Power (kW)')
plt.tight_layout()
plt.show()

Original size: 1000, After removing outliers: 981

Validate Linear Relationship

To find if there is any linear relation between the OAT and Power, let's plot a scatter plot ?

plt.figure(figsize=(8, 6))
plt.scatter(df_clean['OAT (F)'], df_clean['Power (kW)'], alpha=0.6)
plt.xlabel('Outdoor Air Temperature (F)')
plt.ylabel('Power (kW)')
plt.title('Scatter Plot: OAT vs Power')
plt.grid(True, alpha=0.3)
plt.show()

# Calculate correlation
correlation = df_clean['OAT (F)'].corr(df_clean['Power (kW)'])
print(f"Correlation coefficient: {correlation:.3f}")

Correlation coefficient: 0.456

Linear Regression Model

Now let's build and evaluate our linear regression model using k-fold cross validation ?

X = df_clean[['OAT (F)']]
y = df_clean['Power (kW)']

model = LinearRegression()
scores = []
kfold = KFold(n_splits=3, shuffle=True, random_state=42)

for i, (train, test) in enumerate(kfold.split(X, y)):
    model.fit(X.iloc[train], y.iloc[train])
    score = model.score(X.iloc[test], y.iloc[test])
    scores.append(score)

print("R² scores for each fold:", [f"{score:.4f}" for score in scores])
print(f"Average R² score: {np.mean(scores):.4f}")

# Fit the final model on all data
model.fit(X, y)
print(f"Coefficient (slope): {model.coef_[0]:.4f}")
print(f"Intercept: {model.intercept_:.4f}")

R² scores for each fold: ['0.2089', '0.2011', '0.2156']
Average R² score: 0.2085
Coefficient (slope): 0.2297
Intercept: 111.1749

Visualizing the Regression Line

Let's plot the regression line along with our data points ?

plt.figure(figsize=(10, 6))
plt.scatter(X, y, alpha=0.6, label='Data Points')

# Create prediction line
x_line = np.linspace(X.min(), X.max(), 100)
y_pred = model.predict(x_line.reshape(-1, 1))
plt.plot(x_line, y_pred, color='red', linewidth=2, label='Regression Line')

plt.xlabel('Outdoor Air Temperature (F)')
plt.ylabel('Power (kW)')
plt.title('Linear Regression: OAT vs Power')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Make a sample prediction
sample_temp = 60
predicted_power = model.predict([[sample_temp]])
print(f"Predicted power consumption at {sample_temp}°F: {predicted_power[0]:.2f} kW")

Predicted power consumption at 60°F: 124.96 kW

Model Performance Summary

Metric	Value	Interpretation
R² Score	0.2085	Model explains ~21% of variance
Correlation	0.456	Moderate positive relationship
Coefficient	0.2297	1°F increase ? 0.23 kW increase

Conclusion

In this tutorial, we learned how to perform linear regression using Python. We explored the dataset, removed outliers, validated the linear relationship, and built a predictive model. The model shows a moderate positive relationship between outdoor air temperature and power consumption, though additional features could improve prediction accuracy.

Jennifer Nicholas

Updated on: 2026-03-25T05:42:17+05:30

977 Views

Previous Next