How to Standardize Data in a Pandas DataFrame?

Data standardization, also known as feature scaling, is a crucial preprocessing step that transforms data to have a mean of 0 and standard deviation of 1. This ensures all features contribute equally to machine learning algorithms. Pandas provides several methods to standardize DataFrame columns efficiently.

What is Data Standardization?

Standardization transforms data using the formula: z = (x - ?) / ?, where ? is the mean and ? is the standard deviation. This creates a standard normal distribution with consistent scale across all features.

Method 1: Using StandardScaler from sklearn

The most common approach uses sklearn's StandardScaler class ?

from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

# Create sample data
data = pd.DataFrame({
    'Age': [25, 30, 35, 40, 45],
    'Salary': [50000, 60000, 70000, 80000, 90000]
})

print("Original Data:")
print(data)

# Initialize and apply StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

# Convert back to DataFrame
standardized_df = pd.DataFrame(scaled_data, columns=data.columns)
print("\nStandardized Data:")
print(standardized_df)
Original Data:
   Age  Salary
0   25   50000
1   30   60000
2   35   70000
3   40   80000
4   45   90000

Standardized Data:
        Age    Salary
0 -1.414214 -1.414214
1 -0.707107 -0.707107
2  0.000000  0.000000
3  0.707107  0.707107
4  1.414214  1.414214

Method 2: Using apply() with Custom Function

You can create a custom standardization function and apply it to columns ?

import pandas as pd

# Create sample data
data = pd.DataFrame({
    'Score1': [80, 85, 90, 95, 100],
    'Score2': [70, 75, 80, 85, 90]
})

def standardize_column(column):
    return (column - column.mean()) / column.std()

# Apply standardization to all numeric columns
standardized_df = data.apply(standardize_column)

print("Standardized using apply():")
print(standardized_df)
Standardized using apply():
      Score1    Score2
0 -1.414214 -1.414214
1 -0.707107 -0.707107
2  0.000000  0.000000
3  0.707107  0.707107
4  1.414214  1.414214

Method 3: Using subtract() and divide() Methods

Pandas provides direct methods for mathematical operations ?

import pandas as pd

# Create sample data
data = pd.DataFrame({
    'Height': [160, 165, 170, 175, 180],
    'Weight': [50, 55, 60, 65, 70]
})

# Standardize using subtract and divide methods
standardized_df = data.subtract(data.mean()).divide(data.std())

print("Using subtract() and divide():")
print(standardized_df)
Using subtract() and divide():
      Height    Weight
0 -1.414214 -1.414214
1 -0.707107 -0.707107
2  0.000000  0.000000
3  0.707107  0.707107
4  1.414214  1.414214

Method 4: Using sub() and div() Methods

Alternative shorter method names for the same operations ?

import pandas as pd

# Create sample data
data = pd.DataFrame({
    'Math': [78, 82, 88, 92, 95],
    'Science': [85, 89, 91, 94, 98]
})

# Standardize using sub and div methods
standardized_df = data.sub(data.mean()).div(data.std())

print("Using sub() and div():")
print(standardized_df)
Using sub() and div():
       Math   Science
0 -1.569478 -1.500555
1 -0.784739 -0.600222
2  0.261580  0.000000
3  1.046319  0.600222
4  1.569478  1.500555

Comparison of Methods

Method Best For Advantages
StandardScaler Machine learning pipelines Handles train/test splits, inverse transform
apply() Custom transformations Flexible, readable code
subtract()/divide() Simple standardization Direct pandas operations
sub()/div() Concise code Shorter method names

Conclusion

Data standardization is essential for machine learning algorithms sensitive to feature scales. Use StandardScaler for production pipelines and pandas methods for quick data exploration. Choose the method that best fits your workflow and requirements.

Updated on: 2026-03-27T13:57:42+05:30

4K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements