Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Write a Python program to separate a series of alphabets and digits and convert them to a dataframe
When working with mixed alphanumeric data in Pandas, you often need to separate alphabetic and numeric parts into different columns. This is commonly done using the str.extract() method with regular expressions.
Problem Statement
Given a Pandas Series containing strings with both letters and digits, we need to separate them into two columns in a DataFrame ?
Original Series:
0 abx123
1 bcd25
2 cxy30
dtype: object
Expected DataFrame:
0 1
0 abx 123
1 bcd 25
2 cxy 30
Solution Using str.extract()
The str.extract() method uses regular expressions with capturing groups to extract parts of strings. Each group in parentheses becomes a separate column ?
import pandas as pd
# Create a series with mixed alphanumeric data
series = pd.Series(['abx123', 'bcd25', 'cxy30'])
print("Original series:")
print(series)
# Extract alphabets and digits using regex
df = series.str.extract(r'([a-z]+)(\d+)')
print("\nDataFrame after extraction:")
print(df)
Original series:
0 abx123
1 bcd25
2 cxy30
dtype: object
DataFrame after extraction:
0 1
0 abx 123
1 bcd 25
2 cxy 30
Understanding the Regular Expression
The pattern r'([a-z]+)(\d+)' consists of two capturing groups ?
([a-z]+)- Captures one or more lowercase letters(\d+)- Captures one or more digits
Adding Column Names
You can assign meaningful column names to make the DataFrame more readable ?
import pandas as pd
series = pd.Series(['abx123', 'bcd25', 'cxy30'])
# Extract with column names
df = series.str.extract(r'([a-z]+)(\d+)', expand=True)
df.columns = ['Letters', 'Numbers']
print("DataFrame with column names:")
print(df)
DataFrame with column names: Letters Numbers 0 abx 123 1 bcd 25 2 cxy 30
Alternative Approach
For more complex patterns, you can use named groups in the regex ?
import pandas as pd
series = pd.Series(['abx123', 'bcd25', 'cxy30'])
# Using named groups
df = series.str.extract(r'(?P<text>[a-z]+)(?P<number>\d+)')
print("DataFrame with named groups:")
print(df)
DataFrame with named groups: text number 0 abx 123 1 bcd 25 2 cxy 30
Conclusion
Use str.extract() with regex capturing groups to separate mixed alphanumeric data into DataFrame columns. Named groups provide more descriptive column names automatically.
