Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Loan Eligibility Prediction using Machine Learning Models in Python
Predicting loan eligibility is a crucial part of the banking and finance sector. It is used by financial institutions, especially banks, to determine whether to approve a loan application. A number of variables are taken into consideration, including the applicant's income, credit history, loan amount, education, and employment situation.
In this article, we will demonstrate how to predict loan eligibility using Python and its machine learning modules. We'll introduce some machine learning models, going over their fundamental ideas and demonstrating how they can be used to generate predictions.
Understanding the Problem
Predicting whether a loan will be accepted or not is the objective here. This is a binary classification problem with two classes: Loan Approved and Loan Not Approved.
Data Preparation
Let's create a sample dataset and prepare it for machine learning. The dataset includes features like applicant's gender, marital status, education, number of dependents, income, loan amount, and credit history.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
# Create sample loan data
np.random.seed(42)
n_samples = 500
data = pd.DataFrame({
'Gender': np.random.choice(['Male', 'Female'], n_samples),
'Married': np.random.choice(['Yes', 'No'], n_samples),
'Education': np.random.choice(['Graduate', 'Not Graduate'], n_samples),
'ApplicantIncome': np.random.randint(2000, 10000, n_samples),
'LoanAmount': np.random.randint(50, 500, n_samples),
'Credit_History': np.random.choice([0, 1], n_samples),
'Loan_Status': np.random.choice(['Y', 'N'], n_samples)
})
print("Dataset shape:", data.shape)
print("\nFirst 5 rows:")
print(data.head())
Dataset shape: (500, 7) First 5 rows: Gender Married Education ApplicantIncome LoanAmount Credit_History Loan_Status 0 Male Yes Graduate 8623 169 1 N 1 Male No Not Graduate 3586 350 0 Y 2 Male Yes Graduate 6040 496 1 Y 3 Male Yes Graduate 3024 298 1 N 4 Male Yes Graduate 5649 137 1 Y
Data Preprocessing
We need to convert categorical variables to numerical format and prepare the features and target variables ?
# Encode categorical variables
le = LabelEncoder()
categorical_columns = ['Gender', 'Married', 'Education']
for col in categorical_columns:
data[col] = le.fit_transform(data[col])
# Prepare features (X) and target (y)
X = data.drop('Loan_Status', axis=1)
y = le.fit_transform(data['Loan_Status'])
print("Features shape:", X.shape)
print("Target shape:", y.shape)
print("\nProcessed features:")
print(X.head())
Features shape: (500, 6) Target shape: (500,) Processed features: Gender Married Education ApplicantIncome LoanAmount Credit_History 0 1 1 0 8623 169 1 1 1 0 1 3586 350 0 2 1 1 0 6040 496 1 3 1 1 0 3024 298 1 4 1 1 0 5649 137 1
Machine Learning Models Implementation
We will implement three different machine learning models and compare their performance.
Logistic Regression
Logistic Regression is a statistical method for binary classification problems. It uses the logistic function to model the probability of a particular class.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train Logistic Regression model
lr_model = LogisticRegression(random_state=42)
lr_model.fit(X_train, y_train)
# Make predictions
lr_pred = lr_model.predict(X_test)
# Evaluate the model
lr_accuracy = accuracy_score(y_test, lr_pred)
print("Logistic Regression Accuracy:", lr_accuracy)
Logistic Regression Accuracy: 0.5
Decision Tree
A Decision Tree represents features as internal nodes, decision rules as branches, and outcomes as leaf nodes, resembling a flowchart structure.
from sklearn.tree import DecisionTreeClassifier
# Create and train Decision Tree model
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
# Make predictions
dt_pred = dt_model.predict(X_test)
# Evaluate the model
dt_accuracy = accuracy_score(y_test, dt_pred)
print("Decision Tree Accuracy:", dt_accuracy)
Decision Tree Accuracy: 0.48
Random Forest
Random Forest builds multiple decision trees during training and outputs the class that corresponds to the mode of individual trees' classifications.
from sklearn.ensemble import RandomForestClassifier
# Create and train Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
# Make predictions
rf_pred = rf_model.predict(X_test)
# Evaluate the model
rf_accuracy = accuracy_score(y_test, rf_pred)
print("Random Forest Accuracy:", rf_accuracy)
Random Forest Accuracy: 0.52
Model Comparison
Let's compare the performance of all three models ?
import matplotlib.pyplot as plt
# Create comparison
models = ['Logistic Regression', 'Decision Tree', 'Random Forest']
accuracies = [lr_accuracy, dt_accuracy, rf_accuracy]
# Display results
results_df = pd.DataFrame({
'Model': models,
'Accuracy': accuracies
})
print("Model Performance Comparison:")
print(results_df)
print(f"\nBest performing model: {models[np.argmax(accuracies)]} with {max(accuracies):.2f} accuracy")
Model Performance Comparison:
Model Accuracy
0 Logistic Regression 0.50
1 Decision Tree 0.48
2 Random Forest 0.52
Best performing model: Random Forest with 0.52 accuracy
Feature Importance
Random Forest provides feature importance scores, helping us understand which features contribute most to loan approval decisions ?
# Get feature importance from Random Forest
feature_names = X.columns
importance_scores = rf_model.feature_importances_
# Create feature importance DataFrame
importance_df = pd.DataFrame({
'Feature': feature_names,
'Importance': importance_scores
}).sort_values('Importance', ascending=False)
print("Feature Importance (Random Forest):")
print(importance_df)
Feature Importance (Random Forest):
Feature Importance
3 ApplicantIncome 0.205073
4 LoanAmount 0.198045
5 Credit_History 0.186875
0 Gender 0.152893
2 Education 0.129490
1 Married 0.127623
Conclusion
We successfully implemented three machine learning models for loan eligibility prediction using Python. Random Forest achieved the highest accuracy at 52%, while applicant income and loan amount emerged as the most important features. In real-world scenarios, proper data preprocessing, feature engineering, and model tuning would significantly improve performance.
