Predicting customer next purchase using machine learning

Machine Learning Python Data Science

Retaining customers is essential for succeeding in a cutthroat market. Retaining current consumers is more cost-effective than acquiring new ones. Customer retention results in a devoted clientele, increased revenue, and long-term profitability. However, a number of factors, including economic conditions, competition, and fashion trends, make it difficult to forecast client behavior and preferences. Businesses require sophisticated machine learning and data analytics capabilities to analyze consumer data and produce precise projections in order to address these challenges. Businesses can adjust marketing efforts, improve the customer experience, and increase happiness by foreseeing their consumers' next purchases, which will eventually increase retention and loyalty. In this article, we'll apply machine learning to predict readers' next purchases.

Predicting customer next purchase using Machine Learning

Here is a step-by-step guide for using machine learning to forecast a customer's upcoming purchase −

Collect and prepare the data by loading it, doing feature engineering, cleaning it, and importing the necessary libraries
Create training and test sets from the data.
Utilizing the training data, create a random forest regressor model
Use a variety of measures to assess the model's performance, including the explained variance score, R-squared, mean absolute error, and mean squared error.

Algorithm

Import the necessary libraries, including datetime, numpy, and pandas.
Use pd.read_excel() to load the data, then save it as a DataFrame.
Remove any entries lacking a CustomerID since without one, we are unable to anticipate a customer's next purchase. To do this, use df = df[df['CustomerID'].notna()].
Remove any rows with a quantity of 0 or a negative value as they are probably incorrect. The formula used for this is df = df[df['Quantity'] > 0].
Eliminate any entries with a price that is negative or zero since these are also probably mistakes. The formula used for this is df = df[df['UnitPrice'] > 0].
Use pd.to_datetime() to transform the InvoiceDate column into a datetime object.
TotalPrice, which is the result of the Quantity and Price columns, represents the entire cost of each transaction.
The year and month of each transaction are found in the InvoiceYearMonth column, which is derived from the InvoiceDate column.
LastPurchaseDate − The date on which each consumer last made a purchase.
DaysSinceLastPurchase − How many days have passed since each customer's most recent purchase
NextPurchaseDate is a date that is chosen at random and falls between 7 and 30 days following a customer's most recent purchase.
DaysUntilNextPurchase − The duration until each customer's subsequent purchase.
Choose CustomerID, TotalPrice, InvoiceYearMonth, DaysSinceLastPurchase, and DaysUntilNextPurchase as the columns we wish to utilize for training the model
Use df = df.drop_duplicates() to eliminate any duplicate rows.

Gather and prepare the data

We acquire the data in this stage and carry out any necessary feature engineering and cleaning. The UCI Machine Learning Repository makes the Online Retail dataset available here to the public, and we can use it for this project.

# Import libraries
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, explained_variance_score

# Step 1: Gather and Prepare the Data
df = pd.read_excel('/content/sample_data/online_retail_II.xlsx')
df = df[df['Customer ID'].notna()] # Remove rows without CustomerID
df = df[df['Quantity'] > 0] # Remove rows with negative or zero quantity
df = df[df['Price'] > 0] # Remove rows with negative or zero price

# Parse dates
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# Create features
df['TotalPrice'] = df['Quantity'] * df['Price']
df['InvoiceYearMonth'] = df['InvoiceDate'].apply(lambda x: x.strftime('%Y%m'))
df['LastPurchaseDate'] = df.groupby('Customer ID')['InvoiceDate'].transform('max')
df['DaysSinceLastPurchase'] = (df['LastPurchaseDate'].max() - df['LastPurchaseDate']).dt.days
df['NextPurchaseDate'] = df.groupby('Customer ID')['InvoiceDate'].transform(lambda x: x.max() + timedelta(days=np.random.randint(7, 30)))
df['DaysUntilNextPurchase'] = (df['NextPurchaseDate'] - df['InvoiceDate']).dt.days
df = df[['Customer ID', 'TotalPrice', 'InvoiceYearMonth', 'DaysSinceLastPurchase', 'DaysUntilNextPurchase']]
df = df.drop_duplicates()

Split the data into training and testing set

We divided the data into training and testing sets in this phase.

Use the formulas X = df.drop(['DaysUntilNextPurchase'], axis=1) and y = df['DaysUntilNextPurchase'] to separate the independent variables (X) from the dependent variable (y).
Utilise train_test_split() to divide the data into training and testing sets. We specify a random state of 42 and a test size of 0.2, or 20%.

# Step 2 − Split the Data into Training and Testing Sets
X = df.drop(['DaysUntilNextPurchase'], axis=1)
y = df['DaysUntilNextPurchase']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Training Machine Learning model

On the basis of the training data, we now train a random forest regressor model.

sklearn.ensemble's RandomForestRegressor class should be imported.
With n_estimators equal to 100 and random_state equal to 42, create a new instance of the class.
Model.fit(X_train, y_train) fits the model to the training data.

# Step 3 − Train the Machine Learning Model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

Evaluate the model

At this point, we evaluate the model's effectiveness using a range of metrics.

The model.predict(X_test) makes predictions based on the test data using the learned model.
Calculate the mean absolute error (MAE) using mean_absolute_error(y_test, y_pred).
The mean squared error (MSE) may be calculated using the formula mean_squared_error(y_test, y_pred).
The R-squared (R2) can be calculated using the formula r2_score(y_test, y_pred).

# Step 4− Evaluate the Model
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
evs = explained_variance_score(y_test, y_pred)

print('Mean Absolute Error:', mae)
print('Mean Squared Error:', mse)
print('R-squared:', r2)
print('Explained Variance Score:', evs)

Results

Mean Absolute Error− 2.8361809953950248
Mean Squared Error − 31.313248452439648
R-squared − 0.9975804147472181
Explained Variance Score − 0.9975804233638988

The mean squared error (MSE) calculates the average discrepancy between the expected and actual values. The target variable's variation that is explained by the model is expressed as a percentage by the R2 statistic. When compared to the overall variance, the EVS calculates the percentage of variation in the target variable that is explained by the model.

Conclusion

In conclusion, the method utilised in this post entails acquiring and processing customer data, dividing it into training and testing sets, training a machine learning model, and assessing the model's performance using a variety of metrics. Personalized marketing efforts, enhanced customer experiences, and greater customer retention are just a few potential uses for anticipating a consumer's subsequent purchase.

Jay Singh

Updated on: 31-Jul-2023

505 Views

Kickstart Your Career

Get certified by completing the course

Get Started