Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
House Price Prediction using Machine Learning in Python
House price prediction using machine learning has revolutionized the real estate industry by leveraging Python's powerful data analysis capabilities. This comprehensive guide explores how to build predictive models that help buyers, sellers, and investors make informed decisions in the dynamic housing market.
Linear Regression for House Price Prediction
Linear regression is a widely used technique for house price prediction due to its simplicity and interpretability. It assumes a linear relationship between independent variables (bedrooms, bathrooms, square footage) and the dependent variable (house price).
By fitting a linear regression model to historical data, we estimate coefficients that represent the relationship between features and target variable. This enables predictions on new data by multiplying feature values with their respective coefficients. Linear regression provides insights into each feature's impact on house prices, helping understand the significance of different factors.
Dataset Overview
We'll use the Kaggle KC House Data dataset, which contains house sale prices for King County, including Seattle. The dataset includes features like:
bedrooms ? Number of bedrooms
bathrooms ? Number of bathrooms
sqft_living ? Square footage of living space
sqft_lot ? Square footage of the lot
floors ? Number of floors
zipcode ? ZIP code location
Implementation Steps
Follow these steps to build a house price prediction model:
Step 1: Import Libraries and Load Data
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
# Load the dataset
data = pd.read_csv('kc_house_data.csv')
print("Dataset shape:", data.shape)
print("\nFirst 5 rows:")
print(data.head())
Step 2: Feature Selection and Data Preparation
# Select features and target variable
features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']
target = 'price'
X = data[features]
y = data[target]
print("Features shape:", X.shape)
print("Target shape:", y.shape)
Step 3: Split Data and Train Model
# Create sample data for demonstration
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Generate sample house data
np.random.seed(42)
n_samples = 1000
bedrooms = np.random.randint(1, 6, n_samples)
bathrooms = np.random.uniform(1, 4, n_samples)
sqft_living = np.random.randint(500, 5000, n_samples)
sqft_lot = np.random.randint(1000, 10000, n_samples)
floors = np.random.randint(1, 4, n_samples)
zipcode = np.random.choice([98001, 98002, 98003, 98004, 98005], n_samples)
# Create price based on features (with some noise)
price = (bedrooms * 20000 + bathrooms * 15000 + sqft_living * 100 +
sqft_lot * 5 + floors * 10000 + np.random.normal(0, 20000, n_samples))
# Create DataFrame
data = pd.DataFrame({
'bedrooms': bedrooms,
'bathrooms': bathrooms,
'sqft_living': sqft_living,
'sqft_lot': sqft_lot,
'floors': floors,
'zipcode': zipcode,
'price': price
})
# Select features and target
features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']
X = data[features]
y = data['price']
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
r2_score = model.score(X_test, y_test)
print("Model R² Score:", round(r2_score, 4))
# Predict price for a new house
new_house = pd.DataFrame({
'bedrooms': [3],
'bathrooms': [2.5],
'sqft_living': [2000],
'sqft_lot': [5000],
'floors': [2],
'zipcode': [98004]
})
predicted_price = model.predict(new_house)
print("Predicted Price: $", round(predicted_price[0], 2))
Model R² Score: 0.9476 Predicted Price: $ 346532.44
Model Performance Analysis
from sklearn.metrics import mean_squared_error
import numpy as np
# Calculate additional metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print("Model Performance Metrics:")
print("R² Score:", round(r2_score, 4))
print("Mean Squared Error:", round(mse, 2))
print("Root Mean Squared Error:", round(rmse, 2))
# Display feature coefficients
feature_importance = pd.DataFrame({
'Feature': features,
'Coefficient': model.coef_
})
print("\nFeature Coefficients:")
print(feature_importance.round(2))
Model Performance Metrics:
R² Score: 0.9476
Mean Squared Error: 417688362.57
Root Mean Squared Error: 20437.88
Feature Coefficients:
Feature Coefficient
0 bedrooms 19661.49
1 bathrooms 15424.85
2 sqft_living 100.11
3 sqft_lot 4.99
4 floors 10066.29
5 zipcode -0.25
Key Insights
R² Score ? Measures how well the model explains the variance in house prices
Feature Impact ? Square footage of living space has the highest coefficient, indicating strong influence on price
Model Limitations ? Linear regression assumes linear relationships, which may not capture complex market dynamics
Improving the Model
To enhance prediction accuracy, consider:
Feature Engineering ? Create new features like price per square foot or age of the house
Data Preprocessing ? Handle outliers and normalize features
Advanced Models ? Try Random Forest, Gradient Boosting, or Neural Networks
Cross-validation ? Use k-fold cross-validation for more robust evaluation
Conclusion
Machine learning provides powerful tools for house price prediction in Python. Linear regression offers a simple, interpretable starting point that reveals feature relationships. With proper data preprocessing and feature engineering, these models can provide valuable insights for real estate decision-making in competitive markets.
