Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to Clean String Data in a Given Pandas DataFrame?
String data in Pandas DataFrames often requires cleaning before analysis. This includes removing whitespace, handling special characters, standardizing case, and dealing with missing values. Pandas provides powerful string methods through the .str accessor to handle these tasks efficiently.
Creating Sample Data
Let's start with a DataFrame containing messy string data ?
import pandas as pd
# Create sample data with common string issues
data = {
'Name': [' John Doe ', 'JANE SMITH', ' mary johnson ', ' Bob Wilson '],
'Email': ['john@EXAMPLE.com', 'jane@example.COM', 'mary@Example.com', 'bob@test.org'],
'Phone': ['123-456-7890', '555.123.4567', '(999) 888-7777', ''],
'City': ['New York', 'Los Angeles', None, 'Chicago,IL']
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
Original DataFrame:
Name Email Phone City
0 John Doe john@EXAMPLE.com 123-456-7890 New York
1 JANE SMITH jane@example.COM 555.123.4567 Los Angeles
2 mary johnson mary@Example.com (999) 888-7777 None
3 Bob Wilson bob@test.org Chicago,IL
Removing Leading and Trailing Spaces
The str.strip() method removes whitespace from both ends of strings ?
import pandas as pd
data = {
'Name': [' John Doe ', 'JANE SMITH', ' mary johnson ', ' Bob Wilson '],
'Email': ['john@EXAMPLE.com', 'jane@example.COM', 'mary@Example.com', 'bob@test.org']
}
df = pd.DataFrame(data)
# Remove leading and trailing spaces
df['Name'] = df['Name'].str.strip()
print("After removing spaces:")
print(df['Name'])
After removing spaces: 0 John Doe 1 JANE SMITH 2 mary johnson 3 Bob Wilson Name: Name, dtype: object
Converting to Lowercase
Use str.lower() to standardize text case ?
import pandas as pd
data = {
'Name': ['John Doe', 'JANE SMITH', 'mary johnson', 'Bob Wilson'],
'Email': ['john@EXAMPLE.com', 'jane@example.COM', 'mary@Example.com', 'bob@test.org']
}
df = pd.DataFrame(data)
# Convert to lowercase
df['Name'] = df['Name'].str.lower()
df['Email'] = df['Email'].str.lower()
print("After converting to lowercase:")
print(df)
After converting to lowercase:
Name Email
0 john doe john@example.com
1 jane smith jane@example.com
2 mary johnson mary@example.com
3 bob wilson bob@test.org
Replacing Special Characters
The str.replace() method removes or replaces unwanted characters ?
import pandas as pd
data = {
'Phone': ['123-456-7890', '555.123.4567', '(999) 888-7777', ''],
'City': ['New York', 'Los Angeles', 'Miami', 'Chicago,IL']
}
df = pd.DataFrame(data)
# Remove special characters from phone numbers
df['Phone_Clean'] = df['Phone'].str.replace(r'[-().\s]', '', regex=True)
# Remove comma from city names
df['City_Clean'] = df['City'].str.replace(',', ' ')
print("After cleaning special characters:")
print(df[['Phone', 'Phone_Clean', 'City', 'City_Clean']])
After cleaning special characters:
Phone Phone_Clean City City_Clean
0 123-456-7890 1234567890 New York New York
1 555.123.4567 5551234567 Los Angeles Los Angeles
2 (999) 888-7777 9998887777 Miami Miami
3 Chicago,IL Chicago IL
Handling Missing Values
Use fillna() to replace missing values and isna() to detect them ?
import pandas as pd
data = {
'Name': ['John Doe', 'Jane Smith', 'Mary Johnson', 'Bob Wilson'],
'City': ['New York', 'Los Angeles', None, ''],
'Phone': ['123-456-7890', '', None, '999-888-7777']
}
df = pd.DataFrame(data)
# Check for missing values
print("Missing values before cleaning:")
print(df.isna().sum())
# Fill missing values
df['City'] = df['City'].fillna('Unknown')
df['Phone'] = df['Phone'].fillna('No Phone')
# Replace empty strings with meaningful values
df['City'] = df['City'].replace('', 'Unknown')
df['Phone'] = df['Phone'].replace('', 'No Phone')
print("\nAfter handling missing values:")
print(df)
Missing values before cleaning:
Name 0
City 1
Phone 1
dtype: int64
After handling missing values:
Name City Phone
0 John Doe New York 123-456-7890
1 Jane Smith Los Angeles No Phone
2 Mary Johnson Unknown No Phone
3 Bob Wilson Unknown 999-888-7777
Splitting Strings into Columns
Use str.split() with expand=True to create new columns ?
import pandas as pd
data = {
'Full_Name': ['John Doe', 'Jane Smith', 'Mary Johnson', 'Bob Wilson Jr'],
'Email': ['john@gmail.com', 'jane@yahoo.com', 'mary@outlook.com', 'bob@company.org']
}
df = pd.DataFrame(data)
# Split full name into first and last name
df[['First_Name', 'Last_Name']] = df['Full_Name'].str.split(' ', n=1, expand=True)
# Extract email domain
df['Email_Domain'] = df['Email'].str.split('@').str[1]
print("After splitting strings:")
print(df[['Full_Name', 'First_Name', 'Last_Name', 'Email', 'Email_Domain']])
After splitting strings:
Full_Name First_Name Last_Name Email Email_Domain
0 John Doe John Doe john@gmail.com gmail.com
1 Jane Smith Jane Smith jane@yahoo.com yahoo.com
2 Mary Johnson Mary Johnson mary@outlook.com outlook.com
3 Bob Wilson Jr Bob Wilson Jr bob@company.org company.org
Complete Data Cleaning Example
Here's a comprehensive example combining all cleaning techniques ?
import pandas as pd
# Create messy data
data = {
'Name': [' John DOE ', 'JANE smith', ' mary JOHNSON ', ' Bob Wilson '],
'Email': ['john@EXAMPLE.com', 'jane@example.COM', 'mary@Example.com', 'bob@test.org'],
'Phone': ['123-456-7890', '555.123.4567', '(999) 888-7777', ''],
'Address': ['New York', 'Los Angeles', None, 'Chicago,IL']
}
df = pd.DataFrame(data)
print("Original messy data:")
print(df)
print()
# Step 1: Handle missing values
df['Address'] = df['Address'].fillna('Unknown')
df['Phone'] = df['Phone'].replace('', 'No Phone')
# Step 2: Clean string data
df['Name'] = df['Name'].str.strip().str.title() # Remove spaces and title case
df['Email'] = df['Email'].str.lower() # Lowercase emails
df['Phone'] = df['Phone'].str.replace(r'[-().\s]', '', regex=True) # Remove phone formatting
df['Address'] = df['Address'].str.replace(',', ', ') # Fix comma spacing
# Step 3: Remove duplicates if any
df = df.drop_duplicates()
print("Cleaned data:")
print(df)
Original messy data:
Name Email Phone Address
0 John DOE john@EXAMPLE.com 123-456-7890 New York
1 JANE smith jane@example.COM 555.123.4567 Los Angeles
2 mary JOHNSON mary@Example.com (999) 888-7777 None
3 Bob Wilson bob@test.org Chicago,IL
Cleaned data:
Name Email Phone Address
0 John Doe john@example.com 1234567890 New York
1 Jane Smith jane@example.com 5551234567 Los Angeles
2 Mary Johnson mary@example.com 9998887777 Unknown
3 Bob Wilson bob@test.org No Phone Chicago, IL
Common String Cleaning Methods
| Method | Purpose | Example |
|---|---|---|
str.strip() |
Remove leading/trailing spaces | df['col'].str.strip() |
str.lower() |
Convert to lowercase | df['col'].str.lower() |
str.replace() |
Replace characters | df['col'].str.replace('-', '') |
str.split() |
Split into columns | df['col'].str.split(' ', expand=True) |
fillna() |
Fill missing values | df['col'].fillna('Unknown') |
Conclusion
Cleaning string data is essential for reliable data analysis. Use str methods for text operations, fillna() for missing values, and combine techniques for comprehensive cleaning. Always validate your cleaned data to ensure it meets your analysis requirements.
