Disease Prediction Using Machine Learning with examples

Disease prediction is a crucial application of machine learning that can help improve healthcare by enabling early diagnosis and intervention. Machine learning algorithms can analyze patient data to identify patterns and predict the likelihood of a disease or condition. In this article, we will explore how disease prediction using machine learning works with practical examples.

Disease Prediction Workflow

Disease prediction using machine learning involves the following steps ?

  • Data collection ? The first step is to collect patient data, including medical history, symptoms, and diagnostic test results. This data is then compiled into a dataset.

  • Data preprocessing ? The dataset is preprocessed to remove missing or irrelevant data and transform it into a format that can be used by machine learning algorithms.

  • Feature selection ? The most important features are selected from the dataset based on their relevance to the disease being predicted.

  • Model selection ? A suitable machine learning model is selected based on the type of data and the disease being predicted. Common models include logistic regression, decision trees, random forests, support vector machines, and neural networks.

  • Training ? The selected machine learning model is trained using the preprocessed dataset.

  • Testing ? The trained model is tested on a separate dataset to evaluate its performance and accuracy.

  • Prediction ? The trained model is used to predict the likelihood of a disease or condition based on patient data.

Diabetes Prediction Using Logistic Regression

This example demonstrates diabetes prediction using the Pima Indians Diabetes dataset. We'll build a complete prediction system ?

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler

# Create sample diabetes dataset
data = {
    'Pregnancies': [6, 1, 8, 1, 0, 5, 3, 10, 2, 8],
    'Glucose': [148, 85, 183, 89, 137, 116, 78, 115, 197, 125],
    'BloodPressure': [72, 66, 64, 66, 40, 74, 50, 0, 70, 96],
    'BMI': [33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.3, 30.5, 0.0],
    'Age': [50, 31, 32, 21, 33, 30, 26, 29, 53, 54],
    'Outcome': [1, 0, 1, 0, 1, 0, 1, 0, 1, 1]
}

df = pd.DataFrame(data)
print("Dataset shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())
Dataset shape: (10, 6)

First 5 rows:
   Pregnancies  Glucose  BloodPressure   BMI  Age  Outcome
0            6      148             72  33.6   50        1
1            1       85             66  26.6   31        0
2            8      183             64  23.3   32        1
3            1       89             66  28.1   21        0
4            0      137             40  43.1   33        1

Now let's train the logistic regression model and make predictions ?

# Prepare features and target
X = df.drop('Outcome', axis=1)
y = df['Outcome']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train logistic regression model
model = LogisticRegression(random_state=42)
model.fit(X_train_scaled, y_train)

# Make predictions
predictions = model.predict(X_test_scaled)
probabilities = model.predict_proba(X_test_scaled)[:, 1]

print("Predictions:", predictions)
print("Probabilities:", np.round(probabilities, 3))
print("Accuracy:", accuracy_score(y_test, predictions))
Predictions: [1 0 1]
Probabilities: [0.617 0.383 0.617]
Accuracy: 1.0

Heart Disease Prediction Using Random Forest

Random Forest is excellent for heart disease prediction due to its ability to handle multiple features effectively ?

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

# Create heart disease dataset
heart_data = {
    'Age': [63, 37, 41, 56, 57, 44, 52, 57, 54, 48],
    'Sex': [1, 1, 0, 1, 0, 1, 1, 0, 1, 0],
    'ChestPain': [3, 2, 1, 1, 0, 1, 2, 2, 0, 2],
    'RestingBP': [145, 130, 130, 120, 120, 120, 172, 150, 140, 130],
    'Cholesterol': [233, 250, 204, 236, 354, 263, 199, 168, 239, 275],
    'MaxHR': [150, 187, 172, 178, 163, 173, 162, 174, 160, 154],
    'HeartDisease': [1, 0, 0, 0, 1, 0, 0, 1, 1, 0]
}

heart_df = pd.DataFrame(heart_data)

# Prepare features and target
X_heart = heart_df.drop('HeartDisease', axis=1)
y_heart = heart_df['HeartDisease']

# Split and train
X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(X_heart, y_heart, test_size=0.3, random_state=42)

# Train Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_h, y_train_h)

# Make predictions
rf_predictions = rf_model.predict(X_test_h)
rf_probabilities = rf_model.predict_proba(X_test_h)[:, 1]

print("Heart Disease Predictions:", rf_predictions)
print("Risk Probabilities:", np.round(rf_probabilities, 3))
print("Feature Importance:")
for feature, importance in zip(X_heart.columns, rf_model.feature_importances_):
    print(f"{feature}: {importance:.3f}")
Heart Disease Predictions: [0 1 0]
Risk Probabilities: [0.19 0.81 0.19]
Feature Importance:
Age: 0.088
Sex: 0.151
ChestPain: 0.174
RestingBP: 0.158
Cholesterol: 0.226
MaxHR: 0.203

Model Performance Comparison

Let's compare different algorithms for disease prediction ?

Algorithm Best For Advantages Limitations
Logistic Regression Binary classification Fast, interpretable Linear relationships only
Random Forest Complex patterns Handles non-linear data Less interpretable
SVM High-dimensional data Works well with small datasets Slow on large datasets
Neural Networks Image/complex data Highly flexible Requires large datasets

Benefits of Machine Learning in Disease Prediction

  • Early diagnosis ? ML enables early detection of diseases, leading to better treatment outcomes and improved patient quality of life.

  • Personalized treatment ? Algorithms can analyze patient data to identify personalized treatment options tailored to individual needs.

  • Improved healthcare efficiency ? ML helps prioritize high-risk patients, leading to more efficient use of healthcare resources.

  • Cost reduction ? Early prediction reduces long-term healthcare costs through preventive care.

Conclusion

Disease prediction using machine learning offers tremendous potential for revolutionizing healthcare through early diagnosis and personalized treatment. With proper data preprocessing and model selection, ML algorithms can achieve high accuracy in predicting various diseases, ultimately improving patient outcomes and healthcare efficiency.

Updated on: 2026-03-27T10:37:57+05:30

2K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements