Loan Eligibility Prediction using Machine Learning Models in Python

Predicting loan eligibility is a crucial part of the banking and finance sector. It is used by financial institutions, especially banks, to determine whether to approve a loan application. A number of variables are taken into consideration, including the applicant's income, credit history, loan amount, education, and employment situation.

In this article, we will demonstrate how to predict loan eligibility using Python and its machine learning modules. We'll introduce some machine learning models, going over their fundamental ideas and demonstrating how they can be used to generate predictions.

Understanding the Problem

Predicting whether a loan will be accepted or not is the objective here. This is a binary classification problem with two classes: Loan Approved and Loan Not Approved.

Data Preparation

Let's create a sample dataset and prepare it for machine learning. The dataset includes features like applicant's gender, marital status, education, number of dependents, income, loan amount, and credit history.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Create sample loan data
np.random.seed(42)
n_samples = 500

data = pd.DataFrame({
    'Gender': np.random.choice(['Male', 'Female'], n_samples),
    'Married': np.random.choice(['Yes', 'No'], n_samples),
    'Education': np.random.choice(['Graduate', 'Not Graduate'], n_samples),
    'ApplicantIncome': np.random.randint(2000, 10000, n_samples),
    'LoanAmount': np.random.randint(50, 500, n_samples),
    'Credit_History': np.random.choice([0, 1], n_samples),
    'Loan_Status': np.random.choice(['Y', 'N'], n_samples)
})

print("Dataset shape:", data.shape)
print("\nFirst 5 rows:")
print(data.head())
Dataset shape: (500, 7)

First 5 rows:
  Gender Married    Education  ApplicantIncome  LoanAmount  Credit_History Loan_Status
0   Male     Yes     Graduate             8623         169               1           N
1   Male      No  Not Graduate             3586         350               0           Y
2   Male     Yes     Graduate             6040         496               1           Y
3   Male     Yes     Graduate             3024         298               1           N
4   Male     Yes     Graduate             5649         137               1           Y

Data Preprocessing

We need to convert categorical variables to numerical format and prepare the features and target variables ?

# Encode categorical variables
le = LabelEncoder()

categorical_columns = ['Gender', 'Married', 'Education']
for col in categorical_columns:
    data[col] = le.fit_transform(data[col])

# Prepare features (X) and target (y)
X = data.drop('Loan_Status', axis=1)
y = le.fit_transform(data['Loan_Status'])

print("Features shape:", X.shape)
print("Target shape:", y.shape)
print("\nProcessed features:")
print(X.head())
Features shape: (500, 6)
Target shape: (500,)

Processed features:
   Gender  Married  Education  ApplicantIncome  LoanAmount  Credit_History
0       1        1          0             8623         169               1
1       1        0          1             3586         350               0
2       1        1          0             6040         496               1
3       1        1          0             3024         298               1
4       1        1          0             5649         137               1

Machine Learning Models Implementation

We will implement three different machine learning models and compare their performance.

Logistic Regression

Logistic Regression is a statistical method for binary classification problems. It uses the logistic function to model the probability of a particular class.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train Logistic Regression model
lr_model = LogisticRegression(random_state=42)
lr_model.fit(X_train, y_train)

# Make predictions
lr_pred = lr_model.predict(X_test)

# Evaluate the model
lr_accuracy = accuracy_score(y_test, lr_pred)
print("Logistic Regression Accuracy:", lr_accuracy)
Logistic Regression Accuracy: 0.5

Decision Tree

A Decision Tree represents features as internal nodes, decision rules as branches, and outcomes as leaf nodes, resembling a flowchart structure.

from sklearn.tree import DecisionTreeClassifier

# Create and train Decision Tree model
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)

# Make predictions
dt_pred = dt_model.predict(X_test)

# Evaluate the model
dt_accuracy = accuracy_score(y_test, dt_pred)
print("Decision Tree Accuracy:", dt_accuracy)
Decision Tree Accuracy: 0.48

Random Forest

Random Forest builds multiple decision trees during training and outputs the class that corresponds to the mode of individual trees' classifications.

from sklearn.ensemble import RandomForestClassifier

# Create and train Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions
rf_pred = rf_model.predict(X_test)

# Evaluate the model
rf_accuracy = accuracy_score(y_test, rf_pred)
print("Random Forest Accuracy:", rf_accuracy)
Random Forest Accuracy: 0.52

Model Comparison

Let's compare the performance of all three models ?

import matplotlib.pyplot as plt

# Create comparison
models = ['Logistic Regression', 'Decision Tree', 'Random Forest']
accuracies = [lr_accuracy, dt_accuracy, rf_accuracy]

# Display results
results_df = pd.DataFrame({
    'Model': models,
    'Accuracy': accuracies
})

print("Model Performance Comparison:")
print(results_df)
print(f"\nBest performing model: {models[np.argmax(accuracies)]} with {max(accuracies):.2f} accuracy")
Model Performance Comparison:
               Model  Accuracy
0  Logistic Regression      0.50
1       Decision Tree      0.48
2       Random Forest      0.52

Best performing model: Random Forest with 0.52 accuracy

Feature Importance

Random Forest provides feature importance scores, helping us understand which features contribute most to loan approval decisions ?

# Get feature importance from Random Forest
feature_names = X.columns
importance_scores = rf_model.feature_importances_

# Create feature importance DataFrame
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importance_scores
}).sort_values('Importance', ascending=False)

print("Feature Importance (Random Forest):")
print(importance_df)
Feature Importance (Random Forest):
           Feature  Importance
3  ApplicantIncome    0.205073
4       LoanAmount    0.198045
5   Credit_History    0.186875
0           Gender    0.152893
2        Education    0.129490
1          Married    0.127623

Conclusion

We successfully implemented three machine learning models for loan eligibility prediction using Python. Random Forest achieved the highest accuracy at 52%, while applicant income and loan amount emerged as the most important features. In real-world scenarios, proper data preprocessing, feature engineering, and model tuning would significantly improve performance.

Updated on: 2026-03-27T08:28:07+05:30

1K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements