- Scikit Learn Tutorial
- Scikit Learn - Home
- Scikit Learn - Introduction
- Scikit Learn - Modelling Process
- Scikit Learn - Data Representation
- Scikit Learn - Estimator API
- Scikit Learn - Conventions
- Scikit Learn - Linear Modeling
- Scikit Learn - Extended Linear Modeling
- Stochastic Gradient Descent
- Scikit Learn - Support Vector Machines
- Scikit Learn - Anomaly Detection
- Scikit Learn - K-Nearest Neighbors
- Scikit Learn - KNN Learning
- Classification with Naïve Bayes
- Scikit Learn - Decision Trees
- Randomized Decision Trees
- Scikit Learn - Boosting Methods
- Scikit Learn - Clustering Methods
- Clustering Performance Evaluation
- Dimensionality Reduction using PCA
- Scikit Learn Useful Resources
- Scikit Learn - Quick Guide
- Scikit Learn - Useful Resources
- Scikit Learn - Discussion
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Scikit Learn - Modelling Process
This chapter deals with the modelling process involved in Sklearn. Let us understand about the same in detail and begin with dataset loading.
A collection of data is called dataset. It is having the following two components −
Features − The variables of data are called its features. They are also known as predictors, inputs or attributes.
Feature matrix − It is the collection of features, in case there are more than one.
Feature Names − It is the list of all the names of the features.
Response − It is the output variable that basically depends upon the feature variables. They are also known as target, label or output.
Response Vector − It is used to represent response column. Generally, we have just one response column.
Target Names − It represent the possible values taken by a response vector.
Scikit-learn have few example datasets like iris and digits for classification and the Boston house prices for regression.
Following is an example to load iris dataset −
from sklearn.datasets import load_iris iris = load_iris() X = iris.data y = iris.target feature_names = iris.feature_names target_names = iris.target_names print("Feature names:", feature_names) print("Target names:", target_names) print("\nFirst 10 rows of X:\n", X[:10])
Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] Target names: ['setosa' 'versicolor' 'virginica'] First 10 rows of X: [ [5.1 3.5 1.4 0.2] [4.9 3. 1.4 0.2] [4.7 3.2 1.3 0.2] [4.6 3.1 1.5 0.2] [5. 3.6 1.4 0.2] [5.4 3.9 1.7 0.4] [4.6 3.4 1.4 0.3] [5. 3.4 1.5 0.2] [4.4 2.9 1.4 0.2] [4.9 3.1 1.5 0.1] ]
Splitting the dataset
To check the accuracy of our model, we can split the dataset into two pieces-a training set and a testing set. Use the training set to train the model and testing set to test the model. After that, we can evaluate how well our model did.
The following example will split the data into 70:30 ratio, i.e. 70% data will be used as training data and 30% will be used as testing data. The dataset is iris dataset as in above example.
from sklearn.datasets import load_iris iris = load_iris() X = iris.data y = iris.target from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.3, random_state = 1 ) print(X_train.shape) print(X_test.shape) print(y_train.shape) print(y_test.shape)
(105, 4) (45, 4) (105,) (45,)
As seen in the example above, it uses train_test_split() function of scikit-learn to split the dataset. This function has the following arguments −
X, y − Here, X is the feature matrix and y is the response vector, which need to be split.
test_size − This represents the ratio of test data to the total given data. As in the above example, we are setting test_data = 0.3 for 150 rows of X. It will produce test data of 150*0.3 = 45 rows.
random_size − It is used to guarantee that the split will always be the same. This is useful in the situations where you want reproducible results.
Train the Model
Next, we can use our dataset to train some prediction-model. As discussed, scikit-learn has wide range of Machine Learning (ML) algorithms which have a consistent interface for fitting, predicting accuracy, recall etc.
In the example below, we are going to use KNN (K nearest neighbors) classifier. Don’t go into the details of KNN algorithms, as there will be a separate chapter for that. This example is used to make you understand the implementation part only.
from sklearn.datasets import load_iris iris = load_iris() X = iris.data y = iris.target from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.4, random_state=1 ) from sklearn.neighbors import KNeighborsClassifier from sklearn import metrics classifier_knn = KNeighborsClassifier(n_neighbors = 3) classifier_knn.fit(X_train, y_train) y_pred = classifier_knn.predict(X_test) # Finding accuracy by comparing actual response values(y_test)with predicted response value(y_pred) print("Accuracy:", metrics.accuracy_score(y_test, y_pred)) # Providing sample data and the model will make prediction out of that data sample = [[5, 5, 3, 2], [2, 4, 3, 5]] preds = classifier_knn.predict(sample) pred_species = [iris.target_names[p] for p in preds] print("Predictions:", pred_species)
Accuracy: 0.9833333333333333 Predictions: ['versicolor', 'virginica']
Once you train the model, it is desirable that the model should be persist for future use so that we do not need to retrain it again and again. It can be done with the help of dump and load features of joblib package.
Consider the example below in which we will be saving the above trained model (classifier_knn) for future use −
from sklearn.externals import joblib joblib.dump(classifier_knn, 'iris_classifier_knn.joblib')
The above code will save the model into file named iris_classifier_knn.joblib. Now, the object can be reloaded from the file with the help of following code −
Preprocessing the Data
As we are dealing with lots of data and that data is in raw form, before inputting that data to machine learning algorithms, we need to convert it into meaningful data. This process is called preprocessing the data. Scikit-learn has package named preprocessing for this purpose. The preprocessing package has the following techniques −
This preprocessing technique is used when we need to convert our numerical values into Boolean values.
import numpy as np from sklearn import preprocessing Input_data = np.array( [2.1, -1.9, 5.5], [-1.5, 2.4, 3.5], [0.5, -7.9, 5.6], [5.9, 2.3, -5.8]] ) data_binarized = preprocessing.Binarizer(threshold=0.5).transform(input_data) print("\nBinarized data:\n", data_binarized)
In the above example, we used threshold value = 0.5 and that is why, all the values above 0.5 would be converted to 1, and all the values below 0.5 would be converted to 0.
Binarized data: [ [ 1. 0. 1.] [ 0. 1. 1.] [ 0. 0. 1.] [ 1. 1. 0.] ]
This technique is used to eliminate the mean from feature vector so that every feature centered on zero.
import numpy as np from sklearn import preprocessing Input_data = np.array( [2.1, -1.9, 5.5], [-1.5, 2.4, 3.5], [0.5, -7.9, 5.6], [5.9, 2.3, -5.8]] ) #displaying the mean and the standard deviation of the input data print("Mean =", input_data.mean(axis=0)) print("Stddeviation = ", input_data.std(axis=0)) #Removing the mean and the standard deviation of the input data data_scaled = preprocessing.scale(input_data) print("Mean_removed =", data_scaled.mean(axis=0)) print("Stddeviation_removed =", data_scaled.std(axis=0))
Mean = [ 1.75 -1.275 2.2 ] Stddeviation = [ 2.71431391 4.20022321 4.69414529] Mean_removed = [ 1.11022302e-16 0.00000000e+00 0.00000000e+00] Stddeviation_removed = [ 1. 1. 1.]
We use this preprocessing technique for scaling the feature vectors. Scaling of feature vectors is important, because the features should not be synthetically large or small.
import numpy as np from sklearn import preprocessing Input_data = np.array( [ [2.1, -1.9, 5.5], [-1.5, 2.4, 3.5], [0.5, -7.9, 5.6], [5.9, 2.3, -5.8] ] ) data_scaler_minmax = preprocessing.MinMaxScaler(feature_range=(0,1)) data_scaled_minmax = data_scaler_minmax.fit_transform(input_data) print ("\nMin max scaled data:\n", data_scaled_minmax)
Min max scaled data: [ [ 0.48648649 0.58252427 0.99122807] [ 0. 1. 0.81578947] [ 0.27027027 0. 1. ] [ 1. 0.99029126 0. ] ]
We use this preprocessing technique for modifying the feature vectors. Normalisation of feature vectors is necessary so that the feature vectors can be measured at common scale. There are two types of normalisation as follows −
It is also called Least Absolute Deviations. It modifies the value in such a manner that the sum of the absolute values remains always up to 1 in each row. Following example shows the implementation of L1 normalisation on input data.
import numpy as np from sklearn import preprocessing Input_data = np.array( [ [2.1, -1.9, 5.5], [-1.5, 2.4, 3.5], [0.5, -7.9, 5.6], [5.9, 2.3, -5.8] ] ) data_normalized_l1 = preprocessing.normalize(input_data, norm='l1') print("\nL1 normalized data:\n", data_normalized_l1)
L1 normalized data: [ [ 0.22105263 -0.2 0.57894737] [-0.2027027 0.32432432 0.47297297] [ 0.03571429 -0.56428571 0.4 ] [ 0.42142857 0.16428571 -0.41428571] ]
Also called Least Squares. It modifies the value in such a manner that the sum of the squares remains always up to 1 in each row. Following example shows the implementation of L2 normalisation on input data.
import numpy as np from sklearn import preprocessing Input_data = np.array( [ [2.1, -1.9, 5.5], [-1.5, 2.4, 3.5], [0.5, -7.9, 5.6], [5.9, 2.3, -5.8] ] ) data_normalized_l2 = preprocessing.normalize(input_data, norm='l2') print("\nL1 normalized data:\n", data_normalized_l2)
L2 normalized data: [ [ 0.33946114 -0.30713151 0.88906489] [-0.33325106 0.53320169 0.7775858 ] [ 0.05156558 -0.81473612 0.57753446] [ 0.68706914 0.26784051 -0.6754239 ] ]