Article Categories

Selected Reading

How to Clean String Data in a Given Pandas DataFrame?

Python Pandas Server Side Programming Programming

String data in Pandas DataFrames often requires cleaning before analysis. This includes removing whitespace, handling special characters, standardizing case, and dealing with missing values. Pandas provides powerful string methods through the .str accessor to handle these tasks efficiently.

Creating Sample Data

Let's start with a DataFrame containing messy string data ?

import pandas as pd

# Create sample data with common string issues
data = {
    'Name': [' John Doe ', 'JANE SMITH', ' mary johnson ', '  Bob Wilson  '],
    'Email': ['john@EXAMPLE.com', 'jane@example.COM', 'mary@Example.com', 'bob@test.org'],
    'Phone': ['123-456-7890', '555.123.4567', '(999) 888-7777', ''],
    'City': ['New York', 'Los Angeles', None, 'Chicago,IL']
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

Original DataFrame:
           Name            Email          Phone        City
0     John Doe    john@EXAMPLE.com   123-456-7890    New York
1   JANE SMITH    jane@example.COM   555.123.4567  Los Angeles
2  mary johnson   mary@Example.com  (999) 888-7777         None
3    Bob Wilson       bob@test.org                Chicago,IL

Removing Leading and Trailing Spaces

The str.strip() method removes whitespace from both ends of strings ?

import pandas as pd

data = {
    'Name': [' John Doe ', 'JANE SMITH', ' mary johnson ', '  Bob Wilson  '],
    'Email': ['john@EXAMPLE.com', 'jane@example.COM', 'mary@Example.com', 'bob@test.org']
}

df = pd.DataFrame(data)

# Remove leading and trailing spaces
df['Name'] = df['Name'].str.strip()
print("After removing spaces:")
print(df['Name'])

After removing spaces:
0        John Doe
1      JANE SMITH
2    mary johnson
3      Bob Wilson
Name: Name, dtype: object

Converting to Lowercase

Use str.lower() to standardize text case ?

import pandas as pd

data = {
    'Name': ['John Doe', 'JANE SMITH', 'mary johnson', 'Bob Wilson'],
    'Email': ['john@EXAMPLE.com', 'jane@example.COM', 'mary@Example.com', 'bob@test.org']
}

df = pd.DataFrame(data)

# Convert to lowercase
df['Name'] = df['Name'].str.lower()
df['Email'] = df['Email'].str.lower()

print("After converting to lowercase:")
print(df)

After converting to lowercase:
           Name             Email
0      john doe    john@example.com
1    jane smith    jane@example.com
2  mary johnson    mary@example.com
3    bob wilson       bob@test.org

Replacing Special Characters

The str.replace() method removes or replaces unwanted characters ?

import pandas as pd

data = {
    'Phone': ['123-456-7890', '555.123.4567', '(999) 888-7777', ''],
    'City': ['New York', 'Los Angeles', 'Miami', 'Chicago,IL']
}

df = pd.DataFrame(data)

# Remove special characters from phone numbers
df['Phone_Clean'] = df['Phone'].str.replace(r'[-().\s]', '', regex=True)

# Remove comma from city names
df['City_Clean'] = df['City'].str.replace(',', ' ')

print("After cleaning special characters:")
print(df[['Phone', 'Phone_Clean', 'City', 'City_Clean']])

After cleaning special characters:
          Phone Phone_Clean        City   City_Clean
0  123-456-7890  1234567890    New York     New York
1  555.123.4567  5551234567  Los Angeles  Los Angeles
2  (999) 888-7777  9998887777       Miami        Miami
3                                Chicago,IL   Chicago IL

Handling Missing Values

Use fillna() to replace missing values and isna() to detect them ?

import pandas as pd

data = {
    'Name': ['John Doe', 'Jane Smith', 'Mary Johnson', 'Bob Wilson'],
    'City': ['New York', 'Los Angeles', None, ''],
    'Phone': ['123-456-7890', '', None, '999-888-7777']
}

df = pd.DataFrame(data)

# Check for missing values
print("Missing values before cleaning:")
print(df.isna().sum())

# Fill missing values
df['City'] = df['City'].fillna('Unknown')
df['Phone'] = df['Phone'].fillna('No Phone')

# Replace empty strings with meaningful values
df['City'] = df['City'].replace('', 'Unknown')
df['Phone'] = df['Phone'].replace('', 'No Phone')

print("\nAfter handling missing values:")
print(df)

Missing values before cleaning:
Name     0
City     1
Phone    1
dtype: int64

After handling missing values:
        Name         City         Phone
0   John Doe     New York  123-456-7890
1  Jane Smith  Los Angeles      No Phone
2  Mary Johnson      Unknown      No Phone
3  Bob Wilson      Unknown  999-888-7777

Splitting Strings into Columns

Use str.split() with expand=True to create new columns ?

import pandas as pd

data = {
    'Full_Name': ['John Doe', 'Jane Smith', 'Mary Johnson', 'Bob Wilson Jr'],
    'Email': ['john@gmail.com', 'jane@yahoo.com', 'mary@outlook.com', 'bob@company.org']
}

df = pd.DataFrame(data)

# Split full name into first and last name
df[['First_Name', 'Last_Name']] = df['Full_Name'].str.split(' ', n=1, expand=True)

# Extract email domain
df['Email_Domain'] = df['Email'].str.split('@').str[1]

print("After splitting strings:")
print(df[['Full_Name', 'First_Name', 'Last_Name', 'Email', 'Email_Domain']])

After splitting strings:
     Full_Name First_Name  Last_Name           Email Email_Domain
0     John Doe       John        Doe   john@gmail.com    gmail.com
1   Jane Smith       Jane      Smith   jane@yahoo.com    yahoo.com
2 Mary Johnson       Mary    Johnson mary@outlook.com  outlook.com
3 Bob Wilson Jr        Bob  Wilson Jr   bob@company.org  company.org

Complete Data Cleaning Example

Here's a comprehensive example combining all cleaning techniques ?

import pandas as pd

# Create messy data
data = {
    'Name': [' John DOE ', 'JANE smith', ' mary JOHNSON ', '  Bob Wilson  '],
    'Email': ['john@EXAMPLE.com', 'jane@example.COM', 'mary@Example.com', 'bob@test.org'],
    'Phone': ['123-456-7890', '555.123.4567', '(999) 888-7777', ''],
    'Address': ['New York', 'Los Angeles', None, 'Chicago,IL']
}

df = pd.DataFrame(data)
print("Original messy data:")
print(df)
print()

# Step 1: Handle missing values
df['Address'] = df['Address'].fillna('Unknown')
df['Phone'] = df['Phone'].replace('', 'No Phone')

# Step 2: Clean string data
df['Name'] = df['Name'].str.strip().str.title()  # Remove spaces and title case
df['Email'] = df['Email'].str.lower()  # Lowercase emails
df['Phone'] = df['Phone'].str.replace(r'[-().\s]', '', regex=True)  # Remove phone formatting
df['Address'] = df['Address'].str.replace(',', ', ')  # Fix comma spacing

# Step 3: Remove duplicates if any
df = df.drop_duplicates()

print("Cleaned data:")
print(df)

Original messy data:
           Name             Email          Phone      Address
0     John DOE    john@EXAMPLE.com   123-456-7890     New York
1   JANE smith    jane@example.COM   555.123.4567  Los Angeles
2  mary JOHNSON   mary@Example.com  (999) 888-7777         None
3    Bob Wilson       bob@test.org                Chicago,IL

Cleaned data:
           Name             Email        Phone        Address
0      John Doe    john@example.com   1234567890       New York
1    Jane Smith    jane@example.com   5551234567    Los Angeles
2  Mary Johnson    mary@example.com   9998887777        Unknown
3    Bob Wilson       bob@test.org     No Phone   Chicago,  IL

Common String Cleaning Methods

Method	Purpose	Example
`str.strip()`	Remove leading/trailing spaces	`df['col'].str.strip()`
`str.lower()`	Convert to lowercase	`df['col'].str.lower()`
`str.replace()`	Replace characters	`df['col'].str.replace('-', '')`
`str.split()`	Split into columns	`df['col'].str.split(' ', expand=True)`
`fillna()`	Fill missing values	`df['col'].fillna('Unknown')`

Conclusion

Cleaning string data is essential for reliable data analysis. Use str methods for text operations, fillna() for missing values, and combine techniques for comprehensive cleaning. Always validate your cleaned data to ensure it meets your analysis requirements.

Mukul Latiyan

Updated on: 2026-03-27T11:23:03+05:30

2K+ Views

Previous Next