Article Categories

Selected Reading

Building a Machine Learning Model for Customer Churn Prediction with Python and Scikit-Learn

Python Scikit-learn Server Side Programming Programming

Customer churn prediction is a critical business challenge that can significantly impact profitability and growth. This article demonstrates how to build a machine learning model using Python and scikit-learn to predict which customers are likely to leave your business. By analyzing historical customer data, we can identify at-risk customers and implement targeted retention strategies.

Prerequisites and Setup

Before starting, ensure scikit-learn is installed in your Python environment ?

pip install scikit-learn pandas numpy

Building the Customer Churn Prediction Model

We'll create a complete example using synthetic customer data to demonstrate the entire machine learning pipeline from data preparation to model evaluation.

Step 1: Data Preparation

First, let's create a synthetic dataset and handle the initial preprocessing ?

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Create synthetic customer data
np.random.seed(42)
n_samples = 1000

data = {
    'Age': np.random.randint(18, 80, n_samples),
    'MonthlyCharges': np.random.uniform(20, 120, n_samples),
    'TotalCharges': np.random.uniform(50, 8000, n_samples),
    'Tenure': np.random.randint(1, 72, n_samples),
    'Contract': np.random.choice(['Month-to-month', 'One year', 'Two year'], n_samples),
    'PaymentMethod': np.random.choice(['Electronic check', 'Credit card', 'Bank transfer'], n_samples),
    'InternetService': np.random.choice(['DSL', 'Fiber optic', 'No'], n_samples)
}

# Create target variable (churn) with some logic
churn_probability = (
    0.3 * (data['Age'] < 30) +
    0.4 * (data['MonthlyCharges'] > 80) +
    0.5 * (data['Tenure'] < 12) +
    0.3 * (np.array(data['Contract']) == 'Month-to-month')
)

data['Churn'] = (np.random.random(n_samples) < churn_probability).astype(int)

df = pd.DataFrame(data)
print("Dataset shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())

Dataset shape: (1000, 8)

First 5 rows:
   Age  MonthlyCharges  TotalCharges  Tenure      Contract    PaymentMethod InternetService  Churn
0   49       91.297753   6533.613949      10  Month-to-month  Electronic check         DSL      1
1   78       23.148042   5834.494008      47      One year     Credit card  Fiber optic      0
2   65       71.552424   5756.079303      39      Two year  Electronic check         DSL      0
3   82       71.195031   3221.533881      41  Month-to-month  Bank transfer          No      0
4   74       79.910297   6953.064529      61      One year  Electronic check         DSL      0

Step 2: Feature Engineering and Preprocessing

Now we'll encode categorical variables and scale numerical features ?

# Separate features and target
X = df.drop('Churn', axis=1)
y = df['Churn']

# Encode categorical variables
label_encoders = {}
categorical_cols = ['Contract', 'PaymentMethod', 'InternetService']

for col in categorical_cols:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col])
    label_encoders[col] = le

print("Encoded features:")
print(X.head())
print(f"\nChurn distribution:")
print(y.value_counts())

Encoded features:
   Age  MonthlyCharges  TotalCharges  Tenure  Contract  PaymentMethod  InternetService
0   49       91.297753   6533.613949      10         0              1                0
1   78       23.148042   5834.494008      47         1              0                1
2   65       71.552424   5756.079303      39         2              1                0
3   82       71.195031   3221.533881      41         0              2                2
4   74       79.910297   6953.064529      61         1              1                0

Churn distribution:
0    672
1    328
Name: Churn, dtype: int64

Step 3: Model Training and Evaluation

Let's split the data and train multiple models to compare performance ?

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Logistic Regression
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train_scaled, y_train)

# Train Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions
log_reg_pred = log_reg.predict(X_test_scaled)
rf_pred = rf_model.predict(X_test)

# Calculate accuracy
log_reg_accuracy = accuracy_score(y_test, log_reg_pred)
rf_accuracy = accuracy_score(y_test, rf_pred)

print(f"Logistic Regression Accuracy: {log_reg_accuracy:.3f}")
print(f"Random Forest Accuracy: {rf_accuracy:.3f}")

print(f"\nLogistic Regression Classification Report:")
print(classification_report(y_test, log_reg_pred))

Logistic Regression Accuracy: 0.760
Random Forest Accuracy: 0.775

Logistic Regression Classification Report:
              precision    recall  f1-score   support

           0       0.82      0.84      0.83       134
           1       0.66      0.62      0.64        66

    accuracy                           0.76       200
   macro avg       0.74      0.73      0.73       200
weighted avg       0.76      0.76      0.76       200

Step 4: Feature Importance Analysis

Understanding which features contribute most to churn predictions ?

# Feature importance from Random Forest
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print("Feature Importance:")
print(feature_importance)

# Making predictions on new data
new_customer = pd.DataFrame({
    'Age': [25],
    'MonthlyCharges': [95.0],
    'TotalCharges': [200.0],
    'Tenure': [3],
    'Contract': [0],  # Month-to-month
    'PaymentMethod': [1],  # Credit card
    'InternetService': [1]  # Fiber optic
})

new_customer_scaled = scaler.transform(new_customer)
churn_probability = log_reg.predict_proba(new_customer_scaled)[0][1]
prediction = log_reg.predict(new_customer_scaled)[0]

print(f"\nNew Customer Prediction:")
print(f"Churn Probability: {churn_probability:.3f}")
print(f"Predicted Churn: {'Yes' if prediction == 1 else 'No'}")

Feature Importance:
           feature  importance
2     TotalCharges    0.232581
1   MonthlyCharges    0.217439
3           Tenure    0.199846
0              Age    0.158246
4         Contract    0.093412
5    PaymentMethod    0.053064
6  InternetService    0.045411

New Customer Prediction:
Churn Probability: 0.736
Predicted Churn: Yes

Model Performance Comparison

Model	Accuracy	Best For	Advantages
Logistic Regression	76.0%	Interpretability	Fast, provides probabilities
Random Forest	77.5%	Feature importance	Handles non-linear relationships

Key Implementation Steps

The customer churn prediction model involves several critical steps:

Data Preprocessing Handle missing values, encode categorical variables, and scale features
Feature Selection Identify the most relevant customer attributes for prediction
Model Training Train multiple algorithms and compare their performance
Evaluation Use metrics like accuracy, precision, recall, and F1-score
Deployment Implement the model for real-time predictions

Conclusion

Building a machine learning model for customer churn prediction enables businesses to proactively identify at-risk customers and implement retention strategies. The Random Forest model achieved 77.5% accuracy, with TotalCharges and MonthlyCharges being the most important predictive features. This approach helps businesses reduce customer attrition and improve long-term profitability through data-driven decision making.

S Vijay Balaji

Updated on: 2026-03-27T14:16:36+05:30

995 Views

Previous Next