Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to Standardize Data in a Pandas DataFrame?
Data standardization, also known as feature scaling, is a crucial preprocessing step that transforms data to have a mean of 0 and standard deviation of 1. This ensures all features contribute equally to machine learning algorithms. Pandas provides several methods to standardize DataFrame columns efficiently.
What is Data Standardization?
Standardization transforms data using the formula: z = (x - ?) / ?, where ? is the mean and ? is the standard deviation. This creates a standard normal distribution with consistent scale across all features.
Method 1: Using StandardScaler from sklearn
The most common approach uses sklearn's StandardScaler class ?
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
# Create sample data
data = pd.DataFrame({
'Age': [25, 30, 35, 40, 45],
'Salary': [50000, 60000, 70000, 80000, 90000]
})
print("Original Data:")
print(data)
# Initialize and apply StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
# Convert back to DataFrame
standardized_df = pd.DataFrame(scaled_data, columns=data.columns)
print("\nStandardized Data:")
print(standardized_df)
Original Data:
Age Salary
0 25 50000
1 30 60000
2 35 70000
3 40 80000
4 45 90000
Standardized Data:
Age Salary
0 -1.414214 -1.414214
1 -0.707107 -0.707107
2 0.000000 0.000000
3 0.707107 0.707107
4 1.414214 1.414214
Method 2: Using apply() with Custom Function
You can create a custom standardization function and apply it to columns ?
import pandas as pd
# Create sample data
data = pd.DataFrame({
'Score1': [80, 85, 90, 95, 100],
'Score2': [70, 75, 80, 85, 90]
})
def standardize_column(column):
return (column - column.mean()) / column.std()
# Apply standardization to all numeric columns
standardized_df = data.apply(standardize_column)
print("Standardized using apply():")
print(standardized_df)
Standardized using apply():
Score1 Score2
0 -1.414214 -1.414214
1 -0.707107 -0.707107
2 0.000000 0.000000
3 0.707107 0.707107
4 1.414214 1.414214
Method 3: Using subtract() and divide() Methods
Pandas provides direct methods for mathematical operations ?
import pandas as pd
# Create sample data
data = pd.DataFrame({
'Height': [160, 165, 170, 175, 180],
'Weight': [50, 55, 60, 65, 70]
})
# Standardize using subtract and divide methods
standardized_df = data.subtract(data.mean()).divide(data.std())
print("Using subtract() and divide():")
print(standardized_df)
Using subtract() and divide():
Height Weight
0 -1.414214 -1.414214
1 -0.707107 -0.707107
2 0.000000 0.000000
3 0.707107 0.707107
4 1.414214 1.414214
Method 4: Using sub() and div() Methods
Alternative shorter method names for the same operations ?
import pandas as pd
# Create sample data
data = pd.DataFrame({
'Math': [78, 82, 88, 92, 95],
'Science': [85, 89, 91, 94, 98]
})
# Standardize using sub and div methods
standardized_df = data.sub(data.mean()).div(data.std())
print("Using sub() and div():")
print(standardized_df)
Using sub() and div():
Math Science
0 -1.569478 -1.500555
1 -0.784739 -0.600222
2 0.261580 0.000000
3 1.046319 0.600222
4 1.569478 1.500555
Comparison of Methods
| Method | Best For | Advantages |
|---|---|---|
| StandardScaler | Machine learning pipelines | Handles train/test splits, inverse transform |
| apply() | Custom transformations | Flexible, readable code |
| subtract()/divide() | Simple standardization | Direct pandas operations |
| sub()/div() | Concise code | Shorter method names |
Conclusion
Data standardization is essential for machine learning algorithms sensitive to feature scales. Use StandardScaler for production pipelines and pandas methods for quick data exploration. Choose the method that best fits your workflow and requirements.
