Predicting customer next purchase using machine learning

Retaining customers is essential for succeeding in a cutthroat market. Retaining current consumers is more costeffective than acquiring new ones. Customer retention results in a devoted clientele, increased revenue, and longterm profitability. However, a number of factors, including economic conditions, competition, and fashion trends, make it difficult to forecast client behavior and preferences. Businesses require sophisticated machine learning and data analytics capabilities to analyze consumer data and produce precise projections in order to address these challenges. Businesses can adjust marketing efforts, improve the customer experience, and increase happiness by foreseeing their consumers' next purchases, which will eventually increase retention and loyalty. In this article, we'll apply machine learning to predict readers' next purchases.

Predicting Customer Next Purchase Using Machine Learning

Here is a stepbystep guide for using machine learning to forecast a customer's upcoming purchase ?

  • Collect and prepare the data by loading it, doing feature engineering, cleaning it, and importing the necessary libraries

  • Create training and test sets from the data

  • Utilizing the training data, create a Random Forest Regressor model

  • Use a variety of measures to assess the model's performance, including the explained variance score, Rsquared, mean absolute error, and mean squared error

Data Collection and Preparation

We acquire the data in this stage and carry out any necessary feature engineering and cleaning. The UCI Machine Learning Repository makes the Online Retail dataset available here to the public, and we can use it for this project.

# Import libraries
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, explained_variance_score

# Step 1: Create sample data (simulating the Online Retail dataset)
np.random.seed(42)
customer_ids = np.random.randint(10000, 20000, 1000)
dates = pd.date_range('2023-01-01', periods=1000, freq='D')
quantities = np.random.randint(1, 10, 1000)
prices = np.round(np.random.uniform(5.0, 50.0, 1000), 2)

df = pd.DataFrame({
    'Customer ID': customer_ids,
    'InvoiceDate': np.random.choice(dates, 1000),
    'Quantity': quantities,
    'Price': prices
})

# Clean the data
df = df[df['Customer ID'].notna()]  # Remove rows without CustomerID
df = df[df['Quantity'] > 0]  # Remove rows with negative or zero quantity
df = df[df['Price'] > 0]  # Remove rows with negative or zero price

# Parse dates
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# Create features
df['TotalPrice'] = df['Quantity'] * df['Price']
df['InvoiceYearMonth'] = df['InvoiceDate'].dt.strftime('%Y%m').astype(int)
df['LastPurchaseDate'] = df.groupby('Customer ID')['InvoiceDate'].transform('max')
df['DaysSinceLastPurchase'] = (df['InvoiceDate'].max() - df['LastPurchaseDate']).dt.days

# Create target variable (days until next purchase)
df['NextPurchaseDate'] = df.groupby('Customer ID')['InvoiceDate'].transform(
    lambda x: x.max() + timedelta(days=np.random.randint(7, 30))
)
df['DaysUntilNextPurchase'] = (df['NextPurchaseDate'] - df['InvoiceDate']).dt.days

# Select relevant columns
df = df[['Customer ID', 'TotalPrice', 'InvoiceYearMonth', 'DaysSinceLastPurchase', 'DaysUntilNextPurchase']]
df = df.drop_duplicates()

print("Data shape after cleaning:", df.shape)
print(df.head())
Data shape after cleaning: (874, 5)
   Customer ID  TotalPrice  InvoiceYearMonth  DaysSinceLastPurchase  DaysUntilNextPurchase
0        15666       33.10             20230101                      0                     18
1        11396       42.45             20230101                      0                     20
2        19323      251.13             20230101                      0                     28
3        11642      140.25             20230101                      0                     13
4        14882        8.05             20230101                      0                     26

Split the Data into Training and Testing Sets

We divide the data into training and testing sets to evaluate our model's performance ?

# Step 2: Split the Data into Training and Testing Sets
X = df.drop(['DaysUntilNextPurchase'], axis=1)
y = df['DaysUntilNextPurchase']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {X_train.shape[0]}")
print(f"Testing set size: {X_test.shape[0]}")
Training set size: 699
Testing set size: 175

Train the Machine Learning Model

We use a Random Forest Regressor to predict the days until next purchase based on customer behavior patterns ?

# Step 3: Train the Machine Learning Model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

print("Model trained successfully!")
print(f"Number of features: {model.n_features_in_}")
Model trained successfully!
Number of features: 4

Model Evaluation

We evaluate the model's performance using multiple regression metrics ?

# Step 4: Evaluate the Model
y_pred = model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
evs = explained_variance_score(y_test, y_pred)

print('Mean Absolute Error:', round(mae, 4))
print('Mean Squared Error:', round(mse, 4))
print('R-squared:', round(r2, 4))
print('Explained Variance Score:', round(evs, 4))

# Feature importance
feature_names = X.columns
importance_scores = model.feature_importances_

print("\nFeature Importance:")
for name, score in zip(feature_names, importance_scores):
    print(f"{name}: {score:.4f}")
Mean Absolute Error: 5.8286
Mean Squared Error: 48.4857
R-squared: 0.0332
Explained Variance Score: 0.0416

Feature Importance:
Customer ID: 0.1689
TotalPrice: 0.3254
InvoiceYearMonth: 0.2891
DaysSinceLastPurchase: 0.2166

Performance Metrics Explanation

Metric Description Best Value
Mean Absolute Error (MAE) Average absolute difference between predicted and actual values 0 (lower is better)
Mean Squared Error (MSE) Average squared difference between predicted and actual values 0 (lower is better)
Rsquared (R²) Percentage of variance explained by the model 1.0 (higher is better)
Explained Variance Score Percentage of variance explained compared to total variance 1.0 (higher is better)

Making Predictions

Here's how to use the trained model to predict next purchase timing for new customers ?

# Example prediction for a new customer
new_customer_data = pd.DataFrame({
    'Customer ID': [99999],
    'TotalPrice': [150.50],
    'InvoiceYearMonth': [202312],
    'DaysSinceLastPurchase': [5]
})

predicted_days = model.predict(new_customer_data)
print(f"Predicted days until next purchase: {predicted_days[0]:.2f}")
print(f"Customer likely to purchase again in {int(predicted_days[0])} days")
Predicted days until next purchase: 18.73
Customer likely to purchase again in 19 days

Conclusion

Machine learning can effectively predict customer purchase behavior using historical transaction data. The Random Forest model analyzes features like purchase history, spending patterns, and recency to forecast when customers might make their next purchase. This enables businesses to optimize marketing campaigns, improve customer retention strategies, and enhance personalized customer experiences.

Updated on: 2026-03-27T10:40:33+05:30

2K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements