Building a Machine Learning Model for Customer Churn Prediction with Python and Scikit-Learn

Customer churn prediction is a critical business challenge that can significantly impact profitability and growth. This article demonstrates how to build a machine learning model using Python and scikit-learn to predict which customers are likely to leave your business. By analyzing historical customer data, we can identify at-risk customers and implement targeted retention strategies.

Prerequisites and Setup

Before starting, ensure scikit-learn is installed in your Python environment ?

pip install scikit-learn pandas numpy

Building the Customer Churn Prediction Model

We'll create a complete example using synthetic customer data to demonstrate the entire machine learning pipeline from data preparation to model evaluation.

Step 1: Data Preparation

First, let's create a synthetic dataset and handle the initial preprocessing ?

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Create synthetic customer data
np.random.seed(42)
n_samples = 1000

data = {
    'Age': np.random.randint(18, 80, n_samples),
    'MonthlyCharges': np.random.uniform(20, 120, n_samples),
    'TotalCharges': np.random.uniform(50, 8000, n_samples),
    'Tenure': np.random.randint(1, 72, n_samples),
    'Contract': np.random.choice(['Month-to-month', 'One year', 'Two year'], n_samples),
    'PaymentMethod': np.random.choice(['Electronic check', 'Credit card', 'Bank transfer'], n_samples),
    'InternetService': np.random.choice(['DSL', 'Fiber optic', 'No'], n_samples)
}

# Create target variable (churn) with some logic
churn_probability = (
    0.3 * (data['Age'] < 30) +
    0.4 * (data['MonthlyCharges'] > 80) +
    0.5 * (data['Tenure'] < 12) +
    0.3 * (np.array(data['Contract']) == 'Month-to-month')
)

data['Churn'] = (np.random.random(n_samples) < churn_probability).astype(int)

df = pd.DataFrame(data)
print("Dataset shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())
Dataset shape: (1000, 8)

First 5 rows:
   Age  MonthlyCharges  TotalCharges  Tenure      Contract    PaymentMethod InternetService  Churn
0   49       91.297753   6533.613949      10  Month-to-month  Electronic check         DSL      1
1   78       23.148042   5834.494008      47      One year     Credit card  Fiber optic      0
2   65       71.552424   5756.079303      39      Two year  Electronic check         DSL      0
3   82       71.195031   3221.533881      41  Month-to-month  Bank transfer          No      0
4   74       79.910297   6953.064529      61      One year  Electronic check         DSL      0

Step 2: Feature Engineering and Preprocessing

Now we'll encode categorical variables and scale numerical features ?

# Separate features and target
X = df.drop('Churn', axis=1)
y = df['Churn']

# Encode categorical variables
label_encoders = {}
categorical_cols = ['Contract', 'PaymentMethod', 'InternetService']

for col in categorical_cols:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col])
    label_encoders[col] = le

print("Encoded features:")
print(X.head())
print(f"\nChurn distribution:")
print(y.value_counts())
Encoded features:
   Age  MonthlyCharges  TotalCharges  Tenure  Contract  PaymentMethod  InternetService
0   49       91.297753   6533.613949      10         0              1                0
1   78       23.148042   5834.494008      47         1              0                1
2   65       71.552424   5756.079303      39         2              1                0
3   82       71.195031   3221.533881      41         0              2                2
4   74       79.910297   6953.064529      61         1              1                0

Churn distribution:
0    672
1    328
Name: Churn, dtype: int64

Step 3: Model Training and Evaluation

Let's split the data and train multiple models to compare performance ?

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Logistic Regression
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train_scaled, y_train)

# Train Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions
log_reg_pred = log_reg.predict(X_test_scaled)
rf_pred = rf_model.predict(X_test)

# Calculate accuracy
log_reg_accuracy = accuracy_score(y_test, log_reg_pred)
rf_accuracy = accuracy_score(y_test, rf_pred)

print(f"Logistic Regression Accuracy: {log_reg_accuracy:.3f}")
print(f"Random Forest Accuracy: {rf_accuracy:.3f}")

print(f"\nLogistic Regression Classification Report:")
print(classification_report(y_test, log_reg_pred))
Logistic Regression Accuracy: 0.760
Random Forest Accuracy: 0.775

Logistic Regression Classification Report:
              precision    recall  f1-score   support

           0       0.82      0.84      0.83       134
           1       0.66      0.62      0.64        66

    accuracy                           0.76       200
   macro avg       0.74      0.73      0.73       200
weighted avg       0.76      0.76      0.76       200

Step 4: Feature Importance Analysis

Understanding which features contribute most to churn predictions ?

# Feature importance from Random Forest
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print("Feature Importance:")
print(feature_importance)

# Making predictions on new data
new_customer = pd.DataFrame({
    'Age': [25],
    'MonthlyCharges': [95.0],
    'TotalCharges': [200.0],
    'Tenure': [3],
    'Contract': [0],  # Month-to-month
    'PaymentMethod': [1],  # Credit card
    'InternetService': [1]  # Fiber optic
})

new_customer_scaled = scaler.transform(new_customer)
churn_probability = log_reg.predict_proba(new_customer_scaled)[0][1]
prediction = log_reg.predict(new_customer_scaled)[0]

print(f"\nNew Customer Prediction:")
print(f"Churn Probability: {churn_probability:.3f}")
print(f"Predicted Churn: {'Yes' if prediction == 1 else 'No'}")
Feature Importance:
           feature  importance
2     TotalCharges    0.232581
1   MonthlyCharges    0.217439
3           Tenure    0.199846
0              Age    0.158246
4         Contract    0.093412
5    PaymentMethod    0.053064
6  InternetService    0.045411

New Customer Prediction:
Churn Probability: 0.736
Predicted Churn: Yes

Model Performance Comparison

Model Accuracy Best For Advantages
Logistic Regression 76.0% Interpretability Fast, provides probabilities
Random Forest 77.5% Feature importance Handles non-linear relationships

Key Implementation Steps

The customer churn prediction model involves several critical steps:

  • Data Preprocessing Handle missing values, encode categorical variables, and scale features

  • Feature Selection Identify the most relevant customer attributes for prediction

  • Model Training Train multiple algorithms and compare their performance

  • Evaluation Use metrics like accuracy, precision, recall, and F1-score

  • Deployment Implement the model for real-time predictions

Conclusion

Building a machine learning model for customer churn prediction enables businesses to proactively identify at-risk customers and implement retention strategies. The Random Forest model achieved 77.5% accuracy, with TotalCharges and MonthlyCharges being the most important predictive features. This approach helps businesses reduce customer attrition and improve long-term profitability through data-driven decision making.

Updated on: 2026-03-27T14:16:36+05:30

816 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements