7 Easy Steps to Build a Machine Learning Classifier with Codes in Python

Python Server Side Programming Programming

Machine Learning (ML), a branch of Artificial Intelligence (AI), is more than a buzzword right now. It is continuously growing and set to be the most transformative technology existing over the next decade. A few applications of machine learning that have already started having an impact on society include self-driving vehicles, fraud detection systems, and tumor detection.

Machine learning became part of our daily routines as well. From voice-enabled personal assistants (PAs) like Siri, Alexa, and Google Assistant to optimized music, movies, news, and shopping recommendations, to suggestive searches, everything we use is directly or indirectly influenced by machine learning.

If you want to know more about the fundamentals of machine learning, do take a look at the machine learning tutorial.

Classification, a task of supervised machine learning, is the process of assigning a label value to a specific class and then identifying a particular type to be one sort or another. Mathematically, classification is a task of approximating a mapping function (f) from the input variables to the output variables. One of the most basic examples of ML classification is an email spam filtration system where one can classify an email as either "spam" or "not spam".

In this article, with 7 easy steps, we’ll build a machine learning classifier in Python programming language using Breast Cancer Wisconsin Diagnostic Dataset. We will be using a Naïve Bayes (NB) classifier that predicts whether a breast cancer tumor is malignant or benign.

Curious to build the classifier! So, let’s get started.

Step 1: Importing ML Library for Python

To start creating ML classifier in Python, we need an ML library for Python. Here, we will be using Scikit-learn which is one of the best open-source ML libraries for Python. Use the below command to import it −

import sklearn

If you have Scikit-learn installed on your computer, the above command will complete without any error. If it is not installed, you will get an error message something like given below −

Traceback (most recent call last): File "<string>", line 1, in <module> ImportError: No module named 'sklearn'

You can download the Python ML library using the pip command as follows −

pip install -U scikit-learn

Step 2: Importing the Dataset

To build our classifier, we will use Sklearn "Breast Cancer Wisconsin Dataset" which is widely used for classification purposes. It contains 569 instances with 30 numeric, predictive attributes such as the radius of the tumor, texture, perimeter, area, symmetry, smoothness, etc. It also contains two classification labels namely malignant or benign.

Using this dataset, our classifier will predict whether a breast cancer tumor is malignant or benign.

Let’s import and load this dataset −

# Import dataset
from sklearn.datasets import load_breast_cancer
# Load dataset
data_BreastCancer = load_breast_cancer()

The data_BreastCancer variable, which we created above, works like a dictionary. The four important dictionary keys to consider are −

The classification label names (target_names)
The actual labels (target)
The feature names (feature_names)
The attributes (data)

We now need to organize our data by creating new variables for each important dictionary key and assign the data −

# Organizing the data
label_names = data_BreastCancer['target_names']
labels = data_BreastCancer['target']
feature_names = data_BreastCancer['feature_names']
features = data_BreastCancer['data']

To get a better understanding of the dataset, let’s print class labels, the first fifty data instances label, the feature names, and the feature values for the first data instance −

Example

# Look at the data
print('\nClass Labels:',label_names)
print('\nFirst Fifty Data Instance Labels:',labels[:50])
print('\nFeature Names:',feature_names)
print('\nFeature Values for First Data Instance:',features[0])

Output

You will get the following output if you run the code −

Class Labels: ['malignant' 'benign']

First Fifty Data Instance Labels: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 1 0 1 1]

Feature Names: ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothn1ess error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']

Feature Values for First Data Instance: [1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
 1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
 6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
 1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
 4.601e-01 1.189e-01]

As you can see in the output above, our class names are malignant, which is mapped to binary value 0, and benign, which is mapped to binary value 1. It also shows that our first data instance is a malignant tumor whose mean radius is 1.799e+01.

So now we have our data loaded, we can put this data to work to build our ML classifier. Let’s see how in the next steps.

Step 3: Organizing the Data into Training and Testing Sets

To evaluate the accuracy of an ML model, it is always recommended to test the model on unseen data. That’s the reason we will first split our data into two parts namely training and a test set.

Scikit-learn library has train_test_split() function with the help of which we can divide our data into these sets −

from sklearn.model_selection import train_test_split
# Split the data into training and test set
train, test, train_labels, test_labels = train_test_split(features,labels,test_size=0.40,random_state=42)

In this example, we now have a training set that represents 60% of the original dataset and a test set that represents 40% of the original dataset.

Now, it’s time to build and train the classifier model.

Step 4: Build the model

For building our ML classifier, we will use a simple algorithm named Naïve Bayes (NB) that performs well in binary classification tasks. Scikit-learn provides us with three Naïve Bayes models namely,

Gaussian Naïve Bayes − It is based on a continuous distribution characterized by mean and variance.
Multinomial Naïve Bayes − It assumes a feature vector where each element represents the number of times it appears.
Bernoulli Naïve Bayes − It is a binary algorithm. It is useful when we need to check whether a feature is present or not.

As we will be using the Gaussian Naïve Bayes model so we first need to import the GaussianNB module from Sklearn and then initialize the model with the GaussianNB() function.

# Import GaussianNB module
from sklearn.naive_bayes import GaussianNB
# Initializing our classifier model
Gaussian_NB = GaussianNB()

Step 5: Train the Model

We now need to train our classifier by fitting it to the data using the fit() function −

# Train the classifier
NB_Clf = Gaussian_NB.fit(train,train_labels)

Step 6: Making Predictions on Test Set

Once we train the classifier model, we can use it to make predictions on our test set. We use the predict() function that returns an array of predictions for each data instance in the test set.

Let’s use the predict() function and print the predictions −

Example

# Making predictions on the test set
Preds_NBClf = NB_Clf.predict(test)
# Print the predictions
print(Preds_NBClf)

Output

You will get the following output if you run the code −

[1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0
 1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0
 1 1 1 1 1 1 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 1 1 0 1 1 0
 1 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0
 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0
 0 1 1 0 1 0 1 1 1 1 0 1 1 0 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1
 0 0 1 1 0 1]

The above output, an array of 0s and 1s, represents the predicted values for the tumor class i.e., malignant and benign.

Step 7: Evaluating the Classifier Accuracy

In this step, we evaluate the accuracy of ML classifier by comparing two arrays test_labels and Preds_NBClf. Fortunately, for predicting the accuracy, Sklearn library provides us a function named accuracy_score().

Example

# Import accuracy_score module 
from sklearn.metrics import accuracy_score
# Evaluating the accuracy of our classifier
print('Accuracy:',accuracy_score(test_labels, Preds_NBClf))

Output

You will get the following output −

Accuracy:0.9517543859649122

The above output shows that our Naïve Bayes classifier is 95.18% accurate which means that 95.18% our classifier makes the correct prediction about the tumor.

Conclusion

In this article, with 7 easy steps, you learned how to build a machine learning classifier in Python programming language. With these steps, now you can load the dataset, organize the data, train the ML model, predict from the test set, and evaluate the accuracy of the classifier.

You can now definitely experiment with different subsets of features, or you can even try other two Naïve Bayes models, i.e., multinomial and Bernoulli, as well as various other machine learning algorithms.

Gaurav Leekha

Updated on: 21-Aug-2023

137 Views

Kickstart Your Career

Get certified by completing the course

Get Started