Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
7 Easy Steps to Build a Machine Learning Classifier with Codes in Python
Machine Learning (ML), a branch of Artificial Intelligence (AI), is more than a buzzword right now. It is continuously growing and set to be the most transformative technology existing over the next decade. A few applications of machine learning that have already started having an impact on society include self-driving vehicles, fraud detection systems, and tumor detection.
Classification, a task of supervised machine learning, is the process of assigning a label value to a specific class and then identifying a particular type to be one sort or another. One of the most basic examples of ML classification is an email spam filtration system where one can classify an email as either "spam" or "not spam".
In this article, with 7 easy steps, we'll build a machine learning classifier in Python using the Breast Cancer Wisconsin Diagnostic Dataset. We will be using a Naïve Bayes (NB) classifier that predicts whether a breast cancer tumor is malignant or benign.
Step 1: Importing ML Library for Python
To start creating ML classifier in Python, we need an ML library for Python. Here, we will be using Scikit-learn which is one of the best open-source ML libraries for Python. Use the below command to import it -
import sklearn
print("Scikit-learn version:", sklearn.__version__)
Scikit-learn version: 1.3.0
If you don't have Scikit-learn installed, you can download it using the pip command -
pip install -U scikit-learn
Step 2: Importing the Dataset
To build our classifier, we will use Sklearn's Breast Cancer Wisconsin Dataset which is widely used for classification purposes. It contains 569 instances with 30 numeric, predictive attributes such as the radius of the tumor, texture, perimeter, area, symmetry, smoothness, etc. It also contains two classification labels namely malignant or benign.
Let's import and load this dataset -
# Import dataset
from sklearn.datasets import load_breast_cancer
# Load dataset
data_BreastCancer = load_breast_cancer()
# Organizing the data
label_names = data_BreastCancer['target_names']
labels = data_BreastCancer['target']
feature_names = data_BreastCancer['feature_names']
features = data_BreastCancer['data']
# Look at the data
print('Class Labels:', label_names)
print('\nFirst Ten Data Instance Labels:', labels[:10])
print('\nNumber of Features:', len(feature_names))
print('\nDataset Shape:', features.shape)
Class Labels: ['malignant' 'benign'] First Ten Data Instance Labels: [0 0 0 0 0 0 0 0 0 0] Number of Features: 30 Dataset Shape: (569, 30)
As you can see in the output above, our class names are malignant (mapped to 0) and benign (mapped to 1). The dataset contains 569 instances with 30 features each.
Step 3: Organizing the Data into Training and Testing Sets
To evaluate the accuracy of an ML model, it is always recommended to test the model on unseen data. We will split our data into training and test sets using Scikit-learn's train_test_split() function -
from sklearn.model_selection import train_test_split
# Split the data into training and test set
train, test, train_labels, test_labels = train_test_split(
features, labels, test_size=0.40, random_state=42
)
print(f"Training set size: {len(train)} samples")
print(f"Test set size: {len(test)} samples")
Training set size: 341 samples Test set size: 228 samples
In this example, we now have a training set that represents 60% of the original dataset and a test set that represents 40% of the original dataset.
Step 4: Build the Model
For building our ML classifier, we will use a simple algorithm named Naïve Bayes (NB) that performs well in binary classification tasks. Scikit-learn provides us with three Naïve Bayes models -
Gaussian Naïve Bayes - It is based on a continuous distribution characterized by mean and variance.
Multinomial Naïve Bayes - It assumes a feature vector where each element represents the number of times it appears.
Bernoulli Naïve Bayes - It is a binary algorithm useful when we need to check whether a feature is present or not.
Since we have continuous features, we'll use the Gaussian Naïve Bayes model -
# Import GaussianNB module
from sklearn.naive_bayes import GaussianNB
# Initialize the classifier model
Gaussian_NB = GaussianNB()
print("Model initialized successfully")
Model initialized successfully
Step 5: Train the Model
We now need to train our classifier by fitting it to the training data using the fit() function -
# Train the classifier
NB_Clf = Gaussian_NB.fit(train, train_labels)
print("Model training completed")
print("Trained on", len(train), "samples")
Model training completed Trained on 341 samples
Step 6: Making Predictions on Test Set
Once we train the classifier model, we can use it to make predictions on our test set. The predict() function returns an array of predictions for each data instance in the test set -
# Making predictions on the test set
Preds_NBClf = NB_Clf.predict(test)
# Print first 20 predictions
print("First 20 predictions:", Preds_NBClf[:20])
print("Total predictions made:", len(Preds_NBClf))
First 20 predictions: [1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0] Total predictions made: 228
The above output shows an array of 0s and 1s, representing the predicted values for the tumor class (malignant and benign).
Step 7: Evaluating the Classifier Accuracy
In this step, we evaluate the accuracy of our ML classifier by comparing the test labels with our predictions. Sklearn provides the accuracy_score() function for this purpose -
# Import accuracy_score module
from sklearn.metrics import accuracy_score, classification_report
# Evaluating the accuracy of our classifier
accuracy = accuracy_score(test_labels, Preds_NBClf)
print(f"Accuracy: {accuracy:.4f}")
print(f"Accuracy percentage: {accuracy * 100:.2f}%")
# Additional metrics
print("\nDetailed Classification Report:")
print(classification_report(test_labels, Preds_NBClf, target_names=label_names))
Accuracy: 0.9518
Accuracy percentage: 95.18%
Detailed Classification Report:
precision recall f1-score support
malignant 0.94 0.85 0.89 85
benign 0.95 0.99 0.97 143
accuracy 0.95 228
macro avg 0.95 0.92 0.93 228
weighted avg 0.95 0.95 0.95 228
Complete Code Example
Here's the complete code that builds a machine learning classifier in 7 steps -
# Complete Machine Learning Classifier Example
# Step 1: Import libraries
import sklearn
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
# Step 2: Load and organize data
data_BreastCancer = load_breast_cancer()
features = data_BreastCancer['data']
labels = data_BreastCancer['target']
# Step 3: Split into training and testing sets
train, test, train_labels, test_labels = train_test_split(
features, labels, test_size=0.40, random_state=42
)
# Step 4: Build the model
Gaussian_NB = GaussianNB()
# Step 5: Train the model
NB_Clf = Gaussian_NB.fit(train, train_labels)
# Step 6: Make predictions
Preds_NBClf = NB_Clf.predict(test)
# Step 7: Evaluate accuracy
accuracy = accuracy_score(test_labels, Preds_NBClf)
print(f"Final Model Accuracy: {accuracy * 100:.2f}%")
Final Model Accuracy: 95.18%
Conclusion
In this article, you learned how to build a machine learning classifier in Python using 7 easy steps. Our Naïve Bayes classifier achieved an impressive 95.18% accuracy in predicting breast cancer tumor classification. You can now experiment with different algorithms, feature subsets, or try other Naïve Bayes variants like Multinomial and Bernoulli for different types of data.
