Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Learning Model Building in Scikit-learn: A Python Machine Learning Library
Scikit-learn is a powerful and user-friendly machine learning library for Python. It provides simple and efficient tools for data mining, data analysis, and building machine learning models with support for algorithms like random forest, support vector machines, and k-nearest neighbors.
Installing Required Libraries
Before building models, ensure you have the necessary libraries installed ?
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score import seaborn as sns import matplotlib.pyplot as plt
Loading and Exploring the Dataset
We'll use the famous Iris dataset to demonstrate model building ?
from sklearn.datasets import load_iris
# Load the iris dataset
iris = load_iris()
data = pd.DataFrame(iris.data, columns=iris.feature_names)
data['species'] = iris.target
print("Dataset shape:", data.shape)
print("\nFirst 5 rows:")
print(data.head())
Dataset shape: (150, 5) First 5 rows: sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) species 0 5.1 3.5 1.4 0.2 0 1 4.9 3.0 1.4 0.2 0 2 4.7 3.2 1.3 0.2 0 3 4.6 3.1 1.5 0.2 0 4 5.0 3.6 1.4 0.2 0
Data Exploration
Let's examine the dataset structure and check for missing values ?
# Basic information about the dataset
print("Dataset info:")
print(data.info())
print("\nMissing values:")
print(data.isnull().sum())
print("\nTarget classes:")
print(data['species'].value_counts())
Dataset info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 150 entries, 0 to 149 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sepal length (cm) 150 non-null float64 1 sepal width (cm) 150 non-null float64 2 petal length (cm) 150 non-null float64 3 petal width (cm) 150 non-null float64 4 species 150 non-null int32 dtypes: float64(4), int32(1) memory usage: 5.4 KB None Missing values: sepal length (cm) 0 sepal width (cm) 0 petal length (cm) 0 petal width (cm) 0 species 0 dtype: int64 Target classes: species 0 50 1 50 2 50 Name: count, dtype: int64
Data Preprocessing
Prepare the data for model training by separating features and target variable ?
# Separate features and target
X = data.drop('species', axis=1)
y = data['species']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)
Training set shape: (105, 4) Testing set shape: (45, 4)
Model Building and Training
Create and train a Random Forest classifier ?
# Create the model
model = RandomForestClassifier(n_estimators=100, random_state=42)
# Train the model
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
print("Model trained successfully!")
print("Predictions for first 10 test samples:", y_pred[:10])
Model trained successfully! Predictions for first 10 test samples: [1 0 2 1 1 0 1 2 1 1]
Model Evaluation
Evaluate the model's performance using accuracy score ?
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")
# Feature importance
feature_importance = pd.DataFrame({
'feature': X.columns,
'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
print("\nFeature Importance:")
print(feature_importance)
Model Accuracy: 1.00
Feature Importance:
feature importance
2 petal length (cm) 0.458159
3 petal width (cm) 0.413859
0 sepal length (cm) 0.103138
1 sepal width (cm) 0.024844
Key Steps Summary
| Step | Purpose | Scikit-learn Module |
|---|---|---|
| Data Loading | Import dataset | datasets |
| Data Splitting | Train/test separation | model_selection |
| Model Training | Algorithm training | ensemble, svm, etc. |
| Model Evaluation | Performance assessment | metrics |
Conclusion
Scikit-learn provides a streamlined workflow for machine learning model building, from data preprocessing to model evaluation. The library's consistent API makes it easy to experiment with different algorithms and achieve high-performance results on various datasets.
