Learning Model Building in Scikit-learn: A Python Machine Learning Library

Scikit-learn is a powerful and user-friendly machine learning library for Python. It provides simple and efficient tools for data mining, data analysis, and building machine learning models with support for algorithms like random forest, support vector machines, and k-nearest neighbors.

Installing Required Libraries

Before building models, ensure you have the necessary libraries installed ?

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import seaborn as sns
import matplotlib.pyplot as plt

Loading and Exploring the Dataset

We'll use the famous Iris dataset to demonstrate model building ?

from sklearn.datasets import load_iris

# Load the iris dataset
iris = load_iris()
data = pd.DataFrame(iris.data, columns=iris.feature_names)
data['species'] = iris.target

print("Dataset shape:", data.shape)
print("\nFirst 5 rows:")
print(data.head())
Dataset shape: (150, 5)

First 5 rows:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  species
0                5.1               3.5                1.4               0.2        0
1                4.9               3.0                1.4               0.2        0
2                4.7               3.2                1.3               0.2        0
3                4.6               3.1                1.5               0.2        0
4                5.0               3.6                1.4               0.2        0

Data Exploration

Let's examine the dataset structure and check for missing values ?

# Basic information about the dataset
print("Dataset info:")
print(data.info())

print("\nMissing values:")
print(data.isnull().sum())

print("\nTarget classes:")
print(data['species'].value_counts())
Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   species            150 non-null    int32  
dtypes: float64(4), int32(1)
memory usage: 5.4 KB
None

Missing values:
sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
species              0
dtype: int64

Target classes:
species
0    50
1    50
2    50
Name: count, dtype: int64

Data Preprocessing

Prepare the data for model training by separating features and target variable ?

# Separate features and target
X = data.drop('species', axis=1)
y = data['species']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)
Training set shape: (105, 4)
Testing set shape: (45, 4)

Model Building and Training

Create and train a Random Forest classifier ?

# Create the model
model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

print("Model trained successfully!")
print("Predictions for first 10 test samples:", y_pred[:10])
Model trained successfully!
Predictions for first 10 test samples: [1 0 2 1 1 0 1 2 1 1]

Model Evaluation

Evaluate the model's performance using accuracy score ?

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Feature importance
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

print("\nFeature Importance:")
print(feature_importance)
Model Accuracy: 1.00

Feature Importance:
             feature  importance
2  petal length (cm)    0.458159
3   petal width (cm)    0.413859
0  sepal length (cm)    0.103138
1   sepal width (cm)    0.024844

Key Steps Summary

Step Purpose Scikit-learn Module
Data Loading Import dataset datasets
Data Splitting Train/test separation model_selection
Model Training Algorithm training ensemble, svm, etc.
Model Evaluation Performance assessment metrics

Conclusion

Scikit-learn provides a streamlined workflow for machine learning model building, from data preprocessing to model evaluation. The library's consistent API makes it easy to experiment with different algorithms and achieve high-performance results on various datasets.

Updated on: 2026-03-25T06:14:07+05:30

339 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements