How to Convert Unstructured Data to Structured Data Using Python ?

Unstructured data is data that does not follow any specific data model or format, and it can come in different forms such as text, images, audio, and video. Converting unstructured data to structured data is an important task in data analysis, as structured data is easier to analyse and extract insights from. Python provides various libraries and tools for converting unstructured data to structured data, making it more manageable and easier to analyse.

In this article, we will explore how to convert unstructured data into a structured format using Python, allowing for more meaningful analysis and interpretation of the data.

There are different approaches that we can use to convert unstructured data into structured data in Python. In this article, we will discuss the following two approaches:

  • Regular Expressions (Regex): This approach involves using regular expressions to extract structured data from unstructured text. Regex patterns can be defined to match specific patterns in the unstructured text and extract the relevant information.

  • Data Wrangling Libraries: Data wrangling libraries such as pandas can be used to clean and transform unstructured data into a structured format. These libraries provide functions to perform operations such as data cleaning, normalisation, and transformation.

Using Regular Expressions

Regular expressions are powerful tools for extracting structured information from unstructured text. Consider the following example where we extract employee information from unstructured text ?

Example

import re
import pandas as pd

# Sample unstructured text data
text_data = """
Employee ID: 1234
Name: John Doe
Department: Sales
Punch Time: 8:30 AM

Employee ID: 2345
Name: Jane Smith
Department: Marketing
Punch Time: 9:00 AM
"""

# Define regular expression patterns to extract data
id_pattern = re.compile(r'Employee ID: (\d+)')
name_pattern = re.compile(r'Name: (.+)')
dept_pattern = re.compile(r'Department: (.+)')
time_pattern = re.compile(r'Punch Time: (.+)')

# Create empty lists to store extracted data
ids = []
names = []
depts = []
times = []

# Iterate through each line of the text data
for line in text_data.split('\n'):
    # Check if the line matches any of the regular expression patterns
    if id_pattern.match(line):
        ids.append(id_pattern.match(line).group(1))
    elif name_pattern.match(line):
        names.append(name_pattern.match(line).group(1))
    elif dept_pattern.match(line):
        depts.append(dept_pattern.match(line).group(1))
    elif time_pattern.match(line):
        times.append(time_pattern.match(line).group(1))

# Create a DataFrame using the extracted data
data = {'Employee ID': ids, 'Name': names, 'Department': depts, 'Punch Time': times}
df = pd.DataFrame(data)

# Print the DataFrame
print(df)
  Employee ID        Name Department Punch Time
0        1234    John Doe      Sales   8:30 AM
1        2345  Jane Smith  Marketing   9:00 AM

How It Works

  • First, we define the unstructured text data as a multiline string.

  • Next, we define regular expression patterns to extract the relevant data from the text using the re module.

  • We create empty lists to store the extracted data.

  • We iterate through each line of the text data and check if it matches any of the regular expression patterns. If it does, we extract the relevant data and append it to the corresponding list.

  • Finally, we create a Pandas DataFrame using the extracted data and display it.

Using Pandas Library for Data Transformation

Pandas provides powerful tools for reshaping and transforming data. Let's see how to convert employee punch data from a long format to a wide structured format ?

Example

import pandas as pd
from datetime import datetime

# Create sample unstructured data
data = {
    'employee_id': [1001, 1001, 1002, 1002, 1001, 1001, 1002, 1002],
    'date': ['2022-01-01', '2022-01-01', '2022-01-01', '2022-01-01', 
             '2022-01-02', '2022-01-02', '2022-01-02', '2022-01-02'],
    'time': ['09:01:22', '12:35:10', '08:58:30', '17:03:45',
             '09:12:43', '12:37:22', '08:55:10', '17:00:15'],
    'type': ['Punch-In', 'Punch-Out', 'Punch-In', 'Punch-Out',
             'Punch-In', 'Punch-Out', 'Punch-In', 'Punch-Out']
}

unstructured_data = pd.DataFrame(data)
print("Original unstructured data:")
print(unstructured_data)

# Pivot the table to get 'Punch-In' and 'Punch-Out' time for each employee on each date
structured_data = unstructured_data.pivot(index=['employee_id', 'date'], columns='type', values='time').reset_index()

# Rename column names
structured_data = structured_data.rename(columns={"Punch-In": "punch_in", "Punch-Out": "punch_out"})

# Convert time strings to datetime objects for calculation
structured_data['punch_in_dt'] = pd.to_datetime(structured_data['punch_in'], format='%H:%M:%S')
structured_data['punch_out_dt'] = pd.to_datetime(structured_data['punch_out'], format='%H:%M:%S')

# Calculate total hours worked
structured_data['hours_worked'] = (structured_data['punch_out_dt'] - structured_data['punch_in_dt']).dt.total_seconds() / 3600

# Drop the temporary datetime columns
structured_data = structured_data.drop(['punch_in_dt', 'punch_out_dt'], axis=1)

print("\nStructured data:")
print(structured_data)
Original unstructured data:
   employee_id        date      time       type
0         1001  2022-01-01  09:01:22   Punch-In
1         1001  2022-01-01  12:35:10  Punch-Out
2         1002  2022-01-01  08:58:30   Punch-In
3         1002  2022-01-01  17:03:45  Punch-Out
4         1001  2022-01-02  09:12:43   Punch-In
5         1001  2022-01-02  12:37:22  Punch-Out
6         1002  2022-01-02  08:55:10   Punch-In
7         1002  2022-01-02  17:00:15  Punch-Out

Structured data:
type  employee_id        date  punch_in punch_out  hours_worked
0            1001  2022-01-01  09:01:22  12:35:10      3.561111
1            1001  2022-01-02  09:12:43  12:37:22      3.410556
2            1002  2022-01-01  08:58:30  17:03:45      8.087500
3            1002  2022-01-02  08:55:10  17:00:15      8.085000

Key Benefits of Structured Data

  • Easier Analysis: Structured data can be easily queried, filtered, and analysed using standard tools.

  • Better Visualization: Charts and graphs can be created more easily from structured formats.

  • Machine Learning Ready: Most ML algorithms require structured input data.

  • Database Storage: Structured data fits naturally into relational databases.

Comparison of Methods

Method Best For Complexity Performance
Regular Expressions Text pattern extraction Medium Fast
Pandas Operations Data reshaping/transformation Low Very Fast

Conclusion

Converting unstructured data to structured data is essential for effective data analysis. Python offers powerful tools like regular expressions for pattern extraction and pandas for data transformation. Choose the method based on your data type and complexity requirements.

Updated on: 2026-03-27T11:04:31+05:30

2K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements