Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Analyzing selling price of used cars using Python
Analyzing the selling price of used cars is crucial for both buyers and sellers to make informed decisions. By leveraging Python's data analysis and visualization capabilities, we can gain valuable insights from used car datasets and build predictive models for price estimation.
This article explores the complete process of data preprocessing, cleaning, visualization, and price prediction using Linear Regression. We'll use Python's powerful libraries such as pandas, matplotlib, seaborn, and scikit-learn to provide a comprehensive approach to understanding factors influencing used car prices.
Prerequisites
Before starting, ensure you have the required libraries installed. You can install them using pip ?
# Install required packages # pip install pandas matplotlib seaborn scikit-learn
Step 1: Import Essential Libraries
First, let's import all the necessary libraries for data manipulation, visualization, and machine learning ?
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error import numpy as np
Step 2: Create Sample Dataset
Since we cannot access external files in the online environment, let's create a sample dataset that mimics real used car data ?
# Create sample used car dataset
np.random.seed(42)
n_samples = 1000
data = pd.DataFrame({
'price': np.random.normal(15000, 8000, n_samples),
'vehicleType': np.random.choice(['limousine', 'kleinwagen', 'cabrio', 'bus'], n_samples),
'yearOfRegistration': np.random.randint(1995, 2022, n_samples),
'gearbox': np.random.choice(['manuell', 'automatik'], n_samples),
'powerPS': np.random.randint(60, 300, n_samples),
'kilometer': np.random.randint(10000, 300000, n_samples),
'fuelType': np.random.choice(['benzin', 'diesel', 'hybrid'], n_samples),
'brand': np.random.choice(['volkswagen', 'bmw', 'audi', 'mercedes'], n_samples)
})
# Ensure positive prices and realistic values
data['price'] = np.abs(data['price'])
data = data[data['price'] > 500] # Remove unrealistic low prices
print("Dataset shape:", data.shape)
print("\nFirst 5 rows:")
print(data.head())
Dataset shape: (995, 8)
First 5 rows:
price vehicleType yearOfRegistration gearbox powerPS kilometer fuelType brand
0 18973.217687 limousine 2009 automatik 159 205914 benzin volkswagen
1 11991.883565 kleinwagen 2020 manuell 173 299196 diesel bmw
2 19331.086381 bus 2012 automatik 89 51830 hybrid audi
3 26284.735369 cabrio 2021 manuell 104 161289 benzin mercedes
4 13655.352954 limousine 2006 manuell 242 80849 diesel volkswagen
Step 3: Data Exploration and Cleaning
Let's examine the dataset structure and handle any missing values ?
# Check dataset information
print("Dataset Info:")
print(f"Shape: {data.shape}")
print(f"Missing values:\n{data.isnull().sum()}")
print(f"\nBasic statistics:")
print(data.describe())
Dataset Info:
Shape: (995, 8)
Missing values:
price 0
vehicleType 0
yearOfRegistration 0
gearbox 0
powerPS 0
kilometer 0
fuelType 0
brand 0
dtype: int64
Basic statistics:
price yearOfRegistration powerPS kilometer
count 995.000000 995.000000 995.000000 995.000000
mean 14998.894020 2008.488442 179.386935 154894.979899
std 7929.088718 7.829267 69.262055 84130.486798
min 516.265806 1995.000000 60.000000 10073.000000
25% 9046.985611 2002.000000 119.000000 76419.000000
50% 14764.733015 2008.000000 178.000000 154750.000000
75% 20656.949749 2015.000000 238.000000 232712.000000
max 42998.309417 2021.000000 299.000000 299956.000000
Step 4: Price Analysis with Visualizations
Histogram of Selling Prices
Let's visualize the distribution of car prices to understand the price range ?
plt.figure(figsize=(10, 6))
sns.histplot(data['price'], bins=30, kde=True, color='skyblue')
plt.xlabel('Price (?)')
plt.ylabel('Frequency')
plt.title('Distribution of Used Car Prices')
plt.grid(True, alpha=0.3)
plt.show()
Price by Vehicle Type
Compare prices across different vehicle types using a boxplot ?
plt.figure(figsize=(10, 6))
sns.boxplot(x='vehicleType', y='price', data=data, palette='Set2')
plt.xlabel('Vehicle Type')
plt.ylabel('Price (?)')
plt.title('Price Distribution by Vehicle Type')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Year vs Price Relationship
Analyze how car age affects the selling price ?
plt.figure(figsize=(10, 6))
sns.scatterplot(x='yearOfRegistration', y='price', data=data, alpha=0.6, color='coral')
plt.xlabel('Year of Registration')
plt.ylabel('Price (?)')
plt.title('Price vs Year of Registration')
plt.grid(True, alpha=0.3)
plt.show()
Step 5: Correlation Analysis
Let's examine correlations between numerical features ?
# Select numerical features for correlation
numerical_features = ['price', 'yearOfRegistration', 'powerPS', 'kilometer']
correlation_matrix = data[numerical_features].corr()
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
square=True, fmt='.3f')
plt.title('Correlation Matrix of Car Features')
plt.tight_layout()
plt.show()
print("Correlation with Price:")
print(correlation_matrix['price'].sort_values(ascending=False))
Correlation with Price: price 1.000000 yearOfRegistration 0.012977 powerPS -0.015020 kilometer -0.021649 Name: price, dtype: float64
Step 6: Linear Regression Model for Price Prediction
Feature Selection and Data Preparation
Select relevant features for the prediction model ?
# Select features for prediction
features = ['yearOfRegistration', 'powerPS', 'kilometer']
target = 'price'
X = data[features]
y = data[target]
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print(f"Training set size: {X_train.shape}")
print(f"Testing set size: {X_test.shape}")
Training set size: (796, 3) Testing set size: (199, 3)
Model Training and Prediction
Train the Linear Regression model and make predictions ?
# Initialize and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = model.predict(X_test)
# Display first 10 predictions
print("Sample Predictions vs Actual Prices:")
print("Predicted\t|\tActual")
print("-" * 30)
for i in range(10):
print(f"{y_pred[i]:.2f}\t|\t{y_test.iloc[i]:.2f}")
Sample Predictions vs Actual Prices: Predicted | Actual ------------------------------ 15002.73 | 11991.88 14985.36 | 21063.63 15016.58 | 12266.59 15031.16 | 18973.22 15023.53 | 26284.74 15009.99 | 14764.73 15035.61 | 30791.64 14991.46 | 8162.86 14988.38 | 10893.01 15040.55 | 5161.42
Step 7: Model Evaluation
Evaluate the model performance using various metrics ?
from sklearn.metrics import r2_score
# Calculate evaluation metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print("Model Performance:")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"R² Score: {r2:.4f}")
# Display feature coefficients
print("\nFeature Coefficients:")
for feature, coef in zip(features, model.coef_):
print(f"{feature}: {coef:.4f}")
print(f"Intercept: {model.intercept_:.4f}")
Model Performance: Mean Squared Error (MSE): 62962690.17 Root Mean Squared Error (RMSE): 7934.70 R² Score: 0.0012 Feature Coefficients: yearOfRegistration: 9.3627 powerPS: -7.4962 kilometer: -0.0118 Intercept: -3782.7978
Performance Summary
| Metric | Value | Interpretation |
|---|---|---|
| RMSE | ~7,935 ? | Average prediction error |
| R² Score | ~0.001 | Model explains very little variance |
| MSE | ~62.9M | Squared prediction errors |
Conclusion
We successfully demonstrated used car price analysis using Python's data science libraries. The analysis included data exploration, visualization, and Linear Regression modeling. While our simple model showed limited predictive power (low R² score), it provides a foundation for more sophisticated approaches using feature engineering and advanced algorithms.
