How to Standardize Data in a Pandas DataFrame?


In the vast expanse of data exploration, the art of standardization, sometimes referred to as feature scaling, assumes a paramount role as a preparatory step. It involves the transformation of disparate data elements into a harmonized range or scale, enabling fair analysis and comparison. Python's extraordinary library, Pandas, seamlessly facilitates this endeavor.

Picture Pandas DataFrames as two-dimensional, ever-shifting, heterogeneous tabular data arrays, meticulously crafted to streamline the manipulation of data. With intuitive syntax and dynamic capabilities, it has emerged as the structure of choice for data enthusiasts worldwide. Let us delve deeper into the methods we can employ to standardize the data components within such a DataFrame.

Algorithm

Within the confines of this article, we shall focus our attention on the following methods for data standardization in a Pandas DataFrame:

a. Embracing the Power of sklearn.preprocessing.StandardScaler

b. Unleashing the Potential of the pandas.DataFrame.apply Method with z-score

c. Harnessing the Versatility of the pandas.DataFrame.subtract and pandas.DataFrame.divide Methods

d. Exploring the Depths of the pandas.DataFrame.sub and pandas.DataFrame.div Methods

Syntax

Throughout this article, we shall rely on the pandas library, which bestows us with an array of functions to manipulate DataFrames. Here is a concise overview of the syntax for each method:

StandardScaler

scaler = StandardScaler()

`StandardScaler` is a class from the `sklearn.preprocessing` module used to standardize features by removing the mean and scaling to unit variance. First, create an instance of the `StandardScaler` class.

fit_transform()

scaler.fit_transform(X)

the `fit_transform()` method is used to standardize the input data `X`.

apply

df.apply(func, axis=0)

`apply()` is a Pandas dataframe method used to apply a function along a specified axis (rows or columns). `func` is the function to apply, and `axis` is the axis along which the function is applied (0 for columns and 1 for rows).

subtract and divide

df.subtract(df.mean()).divide(df.std())

This syntax standardizes a Pandas dataframe by subtracting the mean (`df.mean()`) and dividing by the standard deviation (`df.std()`) for each column.

sub and div

df.sub(df.mean()).div(df.std())

The following code snippet demonstrates different approaches to perform element-wise subtraction and division for standardizing a Pandas DataFrame. Each method utilizes variations of the sub() and div() methods instead of subtract() and divide().

These operations are commonly used to subtract the mean and divide by the standard deviation for each column in the DataFrame.

Examples

Using sklearn.preprocessing.StandardScaler

In the following example we will:

1. Import necessary libraries: StandardScaler from sklearn, pandas, and numpy.

2. Create a sample DataFrame 'df' with a single column 'A' containing values 1 to 5.

3. Instantiate a StandardScaler object 'scaler' and use it to normalize column 'A' by applying fit_transform() method.

4. Print the updated DataFrame with the standardized values in column 'A'.

from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

# Construct a sample DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5]
})

# Initialize a scaler
scaler = StandardScaler()

# Fit and transform the data
df['A'] = scaler.fit_transform(np.array(df['A']).reshape(-1, 1))

print(df)

Output

          A
0 -1.414214
1 -0.707107
2  0.000000
3  0.707107
4  1.414214

Using the pandas.DataFrame.apply method with z-score

In the example below we are going to:

1. Import pandas library and create a sample DataFrame 'df' with a single column 'A' containing values 1 to 5.

2. Define a function 'standardize' that takes a column and returns the standardized values by subtracting the mean and dividing by the standard deviation.

3. Apply the 'standardize' function to column 'A' using the apply() method.

4. Print the updated DataFrame with the standardized values in column 'A'.

import pandas as pd

# Construct a sample DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5]
})

def standardize(column):
    return (column - column.mean()) / column.std()

# Standardize column 'A' using the apply function
df['A'] = df['A'].apply(standardize)

print(df)

Output

          A
0 -1.414214
1 -0.707107
2  0.000000
3  0.707107
4  1.414214

Utilizing the pandas.DataFrame.subtract and pandas.DataFrame.divide methods

In the following example we will:

1. Import pandas library and create a sample DataFrame 'df' with a single column 'A' containing values 1 to 5.

2. Calculate the mean and standard deviation of column 'A' using mean() and std() methods.

3. Standardize column 'A' by subtracting the mean and dividing by the standard deviation using subtract() and divide() methods.

4. Print the updated DataFrame with the standardized values in column 'A'.

import pandas as pd

# Construct a sample DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5]
})

# Standardize column 'A' using subtract and divide methods
df['A'] = df['A'].subtract(df['A'].mean()).divide(df['A'].std())

print(df)

Output

          A
0 -1.414214
1 -0.707107
2  0.000000
3  0.707107
4  1.414214

Utilizing the pandas.DataFrame.sub and pandas.DataFrame.div methods

In the example below we are going to:

1. Import pandas library and create a sample DataFrame 'df' with a single column 'A' containing values 1 to 5.

2. Calculate the mean and standard deviation of column 'A' using mean() and std() methods.

3. Standardize column 'A' by subtracting the mean and dividing by the standard deviation using sub() and div() methods.

4. Print the updated DataFrame with the standardized values in column 'A'.

import pandas as pd

# Construct a sample DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5]
})

# Standardize column 'A' using sub and div methods
df['A'] = df['A'].sub(df['A'].mean()).div(df['A'].std())

print(df)

Output

          A
0 -1.264911
1 -0.632456
2  0.000000
3  0.632456
4  1.264911

Conclusion

In conclusion, the standardization of data assumes a critical role in preprocessing for various machine learning algorithms, given their sensitivity to the scale of input features. The selection of an appropriate standardization method hinges on the specific algorithm and the nature of the data. Z-score standardization finds its niche when the content follows a normal distribution, while Min-Max normalization emerges as a suitable choice for distributions that are unknown or non-normal. Nonetheless, prudent decision-making in data-related endeavors necessitates a profound understanding of the data itself before committing to a particular scaling method. Grasping the fundamental principles underpinning these methods and mastering their implementation in Python lays a solid foundation for advancing on the enlightening journey of data exploration.

Updated on: 28-Aug-2023

1K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements