How to extract required data from structured strings in Python?

PythonServer Side ProgrammingProgramming

Introduction...

I will show you couple of methods to extract require data/fields from structured strings. These approaches will help, where the format of the input structure is in a known format.

How to do it..

1. Let us create one dummy format to understand the approach.

Report: <> - Time: <> - Player: <> - Titles: - Country: <>

Report: Daily_Report - Time: 2020-10-16T01:01:01.000001 - Player: Federer - Titles: 20 - Country: Switzerland

report = 'Report: Daily_Report - Time: 2020-10-10T12:30:59.000000 - Player: Federer - Titles: 20 - Country: Switzerland'

2. First thing I noticed from the report is the seperator which is "-". We will go ahead and parse the report with "-"

fields = report.split(' - ')
name, time, player , titles, _ = fields

print(f"Output \n *** The report name {name} generated on {time} has {titles} titles for {player}. ")

Output

*** The report name Report: Daily_Report generated on Time: 2020-10-10T12:30:59.000000 has Titles: 20 titles for Player: Federer.

3. Now the output is not as expected as we can see still some labels like Report:, Time:, Player: which are not required.

# extract only report name
formatted_name = name.split(':')[1]

# extract only player
formatted_player = player.split(':')[1]

# extract only titles
formatted_titles = int(titles.split(':')[1])

# extract only titles
new_time = time.split(': ')[1]

print(f"Output \n {formatted_name} , {new_time}, {formatted_player} , {formatted_titles}")

Output

Daily_Report , 2020-10-10T12:30:59.000000, Federer , 20

4. Now the timestamp is in ISO format, which can be split if you want to or leave it as is. Let me show you how you can split a timestamp field.

from datetime import datetime
formatted_date = datetime.fromisoformat(new_time)

print(f"Output \n{formatted_date}")

Output

2020-10-10 12:30:59
  • Now we will combine all these steps to a single function.

def parse_function(log):
"""
Function : Parse the given log in the format
Report: <> - Time: <> - Player: <> - Titles: - Country: <>
Args : log
Return : required data
"""
fields = log.split(' - ')
name, time, player , titles, _ = fields

# extract only report name
formatted_name = name.split(':')[1]

# extract only player
formatted_player = player.split(':')[1]

# extract only titles
formatted_titles = int(titles.split(':')[1])

# extract only titles
new_time = time.split(': ')[1]

return f"{formatted_name} , {new_time}, {formatted_player} , {formatted_titles}"

if __name__ == '__main__':
report = 'Report: Daily_Report - Time: 2020-10-10T12:30:59.000000 - Player: Federer - Titles: 20 - Country: Switzerland'
data = parse_function(report)
print(f"Output \n{data}")

Output

Daily_Report , 2020-10-10T12:30:59.000000, Federer , 20

6. We can use parse module to make it a bit simple. As you see the format create a template. We can use parse module to do this a bit more easily.

First install the parse module by - pip install parse

from parse import parse
report = 'Report: Daily_Report - Time: 2020-10-10T12:30:59.000000 - Player: Federer - Titles: 20 - Country: Switzerland'

# Looking at the report, create a template
template = 'Report: {name} - Time: {time} - Player: {player} - Titles: {titles} - Country: {country}'

# Run parse and check the results
data = parse(template, report)
print(f"Output \n{data}")

Output

<Result () {'name': 'Daily_Report', 'time': '2020-10-10T12:30:59.000000', 'player': 'Federer', 'titles': '20', 'country': 'Switzerland'}>

7. With a simple one liner we are able to extract the data from the log by defining the template. Now let us extract individual values.

print(f"Output \n {data['name']} - {data['time']} - {data['player']} - {data['titles']} - {data['country']}")

Output

Daily_Report - 2020-10-10T12:30:59.000000 - Federer - 20 - Switzerland

Conclusion :

You have seen couple of methods to parse the required data from the log file. Prefer defining the template and use the parse module to extract the data required.

raja
Published on 10-Nov-2020 06:06:53
Advertisements