Article Categories

Selected Reading

Learning Model Building in Scikit-learn: A Python Machine Learning Library

Python Server Side Programming Programming

Scikit-learn is a powerful and user-friendly machine learning library for Python. It provides simple and efficient tools for data mining, data analysis, and building machine learning models with support for algorithms like random forest, support vector machines, and k-nearest neighbors.

Installing Required Libraries

Before building models, ensure you have the necessary libraries installed ?

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import seaborn as sns
import matplotlib.pyplot as plt

Loading and Exploring the Dataset

We'll use the famous Iris dataset to demonstrate model building ?

from sklearn.datasets import load_iris

# Load the iris dataset
iris = load_iris()
data = pd.DataFrame(iris.data, columns=iris.feature_names)
data['species'] = iris.target

print("Dataset shape:", data.shape)
print("\nFirst 5 rows:")
print(data.head())

Dataset shape: (150, 5)

First 5 rows:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  species
0                5.1               3.5                1.4               0.2        0
1                4.9               3.0                1.4               0.2        0
2                4.7               3.2                1.3               0.2        0
3                4.6               3.1                1.5               0.2        0
4                5.0               3.6                1.4               0.2        0

Data Exploration

Let's examine the dataset structure and check for missing values ?

# Basic information about the dataset
print("Dataset info:")
print(data.info())

print("\nMissing values:")
print(data.isnull().sum())

print("\nTarget classes:")
print(data['species'].value_counts())

Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   species            150 non-null    int32  
dtypes: float64(4), int32(1)
memory usage: 5.4 KB
None

Missing values:
sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
species              0
dtype: int64

Target classes:
species
0    50
1    50
2    50
Name: count, dtype: int64

Data Preprocessing

Prepare the data for model training by separating features and target variable ?

# Separate features and target
X = data.drop('species', axis=1)
y = data['species']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)

Training set shape: (105, 4)
Testing set shape: (45, 4)

Model Building and Training

Create and train a Random Forest classifier ?

# Create the model
model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

print("Model trained successfully!")
print("Predictions for first 10 test samples:", y_pred[:10])

Model trained successfully!
Predictions for first 10 test samples: [1 0 2 1 1 0 1 2 1 1]

Model Evaluation

Evaluate the model's performance using accuracy score ?

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Feature importance
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

print("\nFeature Importance:")
print(feature_importance)

Model Accuracy: 1.00

Feature Importance:
             feature  importance
2  petal length (cm)    0.458159
3   petal width (cm)    0.413859
0  sepal length (cm)    0.103138
1   sepal width (cm)    0.024844

Key Steps Summary

Step	Purpose	Scikit-learn Module
Data Loading	Import dataset	datasets
Data Splitting	Train/test separation	model_selection
Model Training	Algorithm training	ensemble, svm, etc.
Model Evaluation	Performance assessment	metrics

Conclusion

Scikit-learn provides a streamlined workflow for machine learning model building, from data preprocessing to model evaluation. The library's consistent API makes it easy to experiment with different algorithms and achieve high-performance results on various datasets.

Pavitra

Updated on: 2026-03-25T06:14:07+05:30

389 Views

Previous Next