Building a Data Pre-processing Pipeline with Python and the Pandas Library

In the field of data analysis and machine learning, data preprocessing plays a vital role in preparing raw data for further analysis and model building. Data preprocessing involves a series of steps that clean, transform, and restructure data to make it suitable for analysis. Python, with its powerful libraries and tools, provides an excellent ecosystem for building robust data preprocessing pipelines. One such library is Pandas, a popular data manipulation and analysis library that offers a wide range of functions and methods for working with structured data.

In this tutorial, we will delve into the process of building a data preprocessing pipeline using Python and the Pandas library. We will cover various essential techniques and functionalities offered by Pandas that will enable us to handle missing data, perform data transformation, handle categorical variables, and normalize data.

Getting Started

Before we proceed with building the data preprocessing pipeline, we need to ensure that we have Pandas installed. Pandas can be easily installed using pip ?

pip install pandas

Once Pandas is successfully installed, we can start building our data preprocessing pipeline. Let's begin by importing the necessary library ?

import pandas as pd
import numpy as np

# Create sample data for demonstration
data = pd.DataFrame({
    'name': ['Alice', 'Bob', None, 'David', 'Eve'],
    'age': [25, None, 35, 22, 28],
    'salary': [50000, 60000, 75000, None, 55000],
    'department': ['HR', 'IT', 'Finance', 'IT', 'HR'],
    'rating': [4.5, 3.8, 4.2, 4.0, 3.9]
})

print("Original Data:")
print(data)
Original Data:
    name   age   salary department  rating
0  Alice  25.0  50000.0         HR     4.5
1    Bob   NaN  60000.0         IT     3.8
2   None  35.0  75000.0    Finance     4.2
3  David  22.0      NaN         IT     4.0
4    Eve  28.0  55000.0         HR     3.9

Building a Data Preprocessing Pipeline

Let's break down the preprocessing pipeline into essential steps and implement each one ?

Step 1: Handling Missing Data

Missing data is a common occurrence in datasets and can significantly impact analysis accuracy. Pandas provides several methods to handle missing values ?

import pandas as pd
import numpy as np

# Create sample data with missing values
data = pd.DataFrame({
    'name': ['Alice', 'Bob', None, 'David', 'Eve'],
    'age': [25, None, 35, 22, 28],
    'salary': [50000, 60000, 75000, None, 55000],
    'department': ['HR', 'IT', 'Finance', 'IT', 'HR'],
    'rating': [4.5, 3.8, 4.2, 4.0, 3.9]
})

# Check for missing values
print("Missing values per column:")
print(data.isnull().sum())

# Fill missing numeric values with mean
data['age'].fillna(data['age'].mean(), inplace=True)
data['salary'].fillna(data['salary'].mean(), inplace=True)

# Fill missing categorical values with mode
data['name'].fillna('Unknown', inplace=True)

print("\nAfter handling missing values:")
print(data)
Missing values per column:
name          1
age           1
salary        1
department    0
rating        0
dtype: int64

After handling missing values:
    name   age   salary department  rating
0  Alice  25.0  50000.0         HR     4.5
1    Bob  27.5  60000.0         IT     3.8
2  Unknown  35.0  75000.0    Finance     4.2
3  David  22.0  60000.0         IT     4.0
4    Eve  28.0  55000.0         HR     3.9

Step 2: Data Transformation

Data transformation involves converting data into suitable formats for analysis. This includes filtering, sorting, and creating new features ?

# Continue with the cleaned data
# Filter employees with rating above 4.0
high_performers = data[data['rating'] > 4.0]
print("High performers (rating > 4.0):")
print(high_performers)

# Sort by salary in descending order
sorted_data = data.sort_values('salary', ascending=False)
print("\nSorted by salary (descending):")
print(sorted_data)

# Create new feature: salary category
data['salary_category'] = pd.cut(data['salary'], 
                                bins=[0, 55000, 70000, float('inf')],
                                labels=['Low', 'Medium', 'High'])
print("\nData with salary categories:")
print(data[['name', 'salary', 'salary_category']])
High performers (rating > 4.0):
    name   age   salary department  rating
0  Alice  25.0  50000.0         HR     4.5
2  Unknown  35.0  75000.0    Finance     4.2

Sorted by salary (descending):
    name   age   salary department  rating
2  Unknown  35.0  75000.0    Finance     4.2
1    Bob  27.5  60000.0         IT     3.8
3  David  22.0  60000.0         IT     4.0
4    Eve  28.0  55000.0         HR     3.9
0  Alice  25.0  50000.0         HR     4.5

Data with salary categories:
    name   salary salary_category
0  Alice  50000.0             Low
1    Bob  60000.0          Medium
2  Unknown  75000.0            High
3  David  60000.0          Medium
4    Eve  55000.0          Medium

Step 3: Handling Categorical Variables

Categorical variables need to be converted to numerical format for machine learning algorithms. One-hot encoding is a common technique ?

# One-hot encode the department column
encoded_data = pd.get_dummies(data, columns=['department'], prefix='dept')
print("Data with one-hot encoded departments:")
print(encoded_data.head())

# Label encoding for salary category
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
data['salary_category_encoded'] = le.fit_transform(data['salary_category'])
print("\nSalary categories with label encoding:")
print(data[['salary_category', 'salary_category_encoded']].drop_duplicates())
Data with one-hot encoded departments:
    name   age   salary  rating salary_category  dept_Finance  dept_HR  dept_IT
0  Alice  25.0  50000.0     4.5             Low         False     True    False
1    Bob  27.5  60000.0     3.8          Medium         False    False     True
2  Unknown  35.0  75000.0     4.2            High          True    False    False
3  David  22.0  60000.0     4.0          Medium         False    False     True
4    Eve  28.0  55000.0     3.9          Medium         False     True    False

Salary categories with label encoding:
salary_category  salary_category_encoded
0             Low                        1
1          Medium                        2
2            High                        0

Step 4: Data Normalization

Normalization ensures all features are on a similar scale, which is important for many machine learning algorithms ?

# Select numeric columns for normalization
numeric_columns = ['age', 'salary', 'rating']
data_numeric = data[numeric_columns].copy()

# Min-Max Scaling (0 to 1)
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
data_minmax = pd.DataFrame(
    scaler.fit_transform(data_numeric),
    columns=[f'{col}_minmax' for col in numeric_columns]
)

# Z-score normalization (StandardScaler)
from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler()
data_standard = pd.DataFrame(
    std_scaler.fit_transform(data_numeric),
    columns=[f'{col}_std' for col in numeric_columns]
)

# Combine original and normalized data
normalized_data = pd.concat([data_numeric, data_minmax, data_standard], axis=1)
print("Original vs Normalized Data:")
print(normalized_data.round(3))
Original vs Normalized Data:
    age   salary  rating  age_minmax  salary_minmax  rating_minmax   age_std  salary_std  rating_std
0  25.0  50000.0     4.5       0.231          0.000          0.857    -0.524      -1.069       1.633
1  27.5  60000.0     3.8       0.423          0.400          0.000     0.000       0.000       0.000
2  35.0  75000.0     4.2       1.000          1.000          0.571     1.571       1.604       0.653
3  22.0  60000.0     4.0       0.000          0.400          0.286    -1.048       0.000       0.327
4  28.0  55000.0     3.9       0.462          0.200          0.143     0.105      -0.535       0.163

Complete Pipeline Function

Here's a reusable function that combines all preprocessing steps ?

def preprocess_data(df, target_column=None):
    """
    Complete data preprocessing pipeline
    """
    # Create a copy to avoid modifying original data
    processed_df = df.copy()
    
    # 1. Handle missing values
    numeric_cols = processed_df.select_dtypes(include=[np.number]).columns
    categorical_cols = processed_df.select_dtypes(include=['object']).columns
    
    # Fill numeric missing values with median
    for col in numeric_cols:
        processed_df[col].fillna(processed_df[col].median(), inplace=True)
    
    # Fill categorical missing values with mode
    for col in categorical_cols:
        processed_df[col].fillna(processed_df[col].mode()[0], inplace=True)
    
    # 2. Handle categorical variables (one-hot encoding)
    processed_df = pd.get_dummies(processed_df, columns=categorical_cols, drop_first=True)
    
    # 3. Normalize numeric features (excluding target if specified)
    features_to_scale = numeric_cols.tolist()
    if target_column and target_column in features_to_scale:
        features_to_scale.remove(target_column)
    
    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    processed_df[features_to_scale] = scaler.fit_transform(processed_df[features_to_scale])
    
    return processed_df, scaler

# Test the pipeline
raw_data = pd.DataFrame({
    'age': [25, None, 35, 22, 28],
    'salary': [50000, 60000, 75000, None, 55000],
    'department': ['HR', 'IT', 'Finance', 'IT', 'HR'],
    'rating': [4.5, 3.8, 4.2, 4.0, 3.9]
})

processed_data, fitted_scaler = preprocess_data(raw_data, target_column='rating')
print("Processed data:")
print(processed_data.round(3))
Processed data:
    age  salary  rating  department_HR  department_IT
0 -0.524  -1.069     4.5          True          False
1  0.000   0.000     3.8         False           True
2  1.571   1.604     4.2         False          False
3 -1.048   0.000     4.0         False           True
4  0.105  -0.535     3.9          True          False

Conclusion

Building an effective data preprocessing pipeline with Python and Pandas involves systematic handling of missing data, data transformation, categorical variable encoding, and normalization. This pipeline ensures your data is clean, consistent, and ready for machine learning algorithms, ultimately leading to more accurate and reliable model performance.

Updated on: 2026-03-27T14:15:15+05:30

685 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements