How to Convert Unstructured Data to Structured Data Using Python ?


Unstructured data is data that does not follow any specific data model or format, and it can come in different forms such as text, images, audio, and video. Converting unstructured data to structured data is an important task in data analysis, as structured data is easier to analyse and extract insights from. Python provides various libraries and tools for converting unstructured data to structured data, making it more manageable and easier to analyse.

In this article, we will explore how to convert unstructured biometric data into a structured format using Python, allowing for more meaningful analysis and interpretation of the data.

While there are different approaches that we can make use of to convert unstructured data into structured data in Python. In this article, we will discuss the following two approaches:

  • Regular Expressions (Regex): This approach involves using regular expressions to extract structured data from unstructured text. Regex patterns can be defined to match specific patterns in the unstructured text and extract the relevant information.

  • Data Wrangling Libraries: Data wrangling libraries such as pandas can be used to clean and transform unstructured data into a structured format. These libraries provide functions to perform operations such as data cleaning, normalisation, and transformation.

Using Regular Expression

Consider the code shown below.

Example

import re
import pandas as pd

# sample unstructured text data
text_data = """
Employee ID: 1234
Name: John Doe
Department: Sales
Punch Time: 8:30 AM

Employee ID: 2345
Name: Jane Smith
Department: Marketing
Punch Time: 9:00 AM
"""

# define regular expression patterns to extract data
id_pattern = re.compile(r'Employee ID: (\d+)')
name_pattern = re.compile(r'Name: (.+)')
dept_pattern = re.compile(r'Department: (.+)')
time_pattern = re.compile(r'Punch Time: (.+)')

# create empty lists to store extracted data
ids = []
names = []
depts = []
times = []

# iterate through each line of the text data
for line in text_data.split('\n'):
    # check if the line matches any of the regular expression patterns
    if id_pattern.match(line):
        ids.append(id_pattern.match(line).group(1))
    elif name_pattern.match(line):
        names.append(name_pattern.match(line).group(1))
    elif dept_pattern.match(line):
        depts.append(dept_pattern.match(line).group(1))
    elif time_pattern.match(line):
        times.append(time_pattern.match(line).group(1))

# create a dataframe using the extracted data
data = {'Employee ID': ids, 'Name': names, 'Department': depts, 'Punch Time': times}
df = pd.DataFrame(data)

# print the dataframe
print(df)

Explanation

  • First, we define the unstructured text data as a multiline string.

  • Next, we define regular expression patterns to extract the relevant data from the text. We use the re module in Python for this.

  • We create empty lists to store the extracted data.

  • We iterate through each line of the text data and check if it matches any of the regular expression patterns. If it does, we extract the relevant data and append it to the corresponding list.

  • Finally, we create a Pandas dataframe using the extracted data and print it.

Output

        Employee ID      Name           Department  Punch Time
0        1234                 John Doe      Sales            8:30 AM
1        2345                 Jane Smith   Marketing      9:00 AM

Using Pandas Library

Suppose we have unstructured data that looks like this.

employee_id,date,time,type
1001,2022-01-01,09:01:22,Punch-In
1001,2022-01-01,12:35:10,Punch-Out
1002,2022-01-01,08:58:30,Punch-In
1002,2022-01-01,17:03:45,Punch-Out
1001,2022-01-02,09:12:43,Punch-In
1001,2022-01-02,12:37:22,Punch-Out
1002,2022-01-02,08:55:10,Punch-In
1002,2022-01-02,17:00:15,Punch-Out

Example

import pandas as pd

# Load unstructured data
unstructured_data = pd.read_csv("unstructured_data.csv")

# Extract date and time from the 'date_time' column
unstructured_data['date'] = pd.to_datetime(unstructured_data['date_time']).dt.date
unstructured_data['time'] = pd.to_datetime(unstructured_data['date_time']).dt.time

# Rename 'date_time' column to 'datetime' and drop it
unstructured_data = unstructured_data.rename(columns={"date_time": "datetime"})
unstructured_data = unstructured_data.drop(['datetime'], axis=1)

# Pivot the table to get 'Punch-In' and 'Punch-Out' time for each employee on each date
structured_data = unstructured_data.pivot(index=['employee_id', 'date'], columns='type', values='time').reset_index()

# Rename column names
structured_data = structured_data.rename(columns={"Punch-In": "punch_in", "Punch-Out": "punch_out"})

# Calculate total hours worked by subtracting 'punch_in' from 'punch_out'
structured_data['hours_worked'] = pd.to_datetime(structured_data['punch_out']) - pd.to_datetime(structured_data['punch_in'])

# Print the structured data
print(structured_data)

Output

type  employee_id        date   punch_in  punch_out hours_worked
0           1001  2022-01-01  09:01:22  12:35:10     03:33:48
1           1001  2022-01-02  09:12:43  12:37:22     03:24:39
2           1002  2022-01-01  08:58:30  17:03:45     08:05:15
3           1002  2022-01-02  08:55:10  17:00:15     08:05:05

Conclusion

In conclusion, unstructured data can be difficult to analyse and interpret. However, with the help of Python and various approaches such as regular expressions, text parsing, and machine learning techniques, it is possible to convert unstructured data into structured data.

Updated on: 03-Aug-2023

1K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements