Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
How to extract required data from structured strings in Python?
Introduction...
I will show you couple of methods to extract require data/fields from structured strings. These approaches will help, where the format of the input structure is in a known format.
How to do it..
1. Let us create one dummy format to understand the approach.
Report: <> - Time: <> - Player: <> - Titles: - Country: <>
Report: Daily_Report - Time: 2020-10-16T01:01:01.000001 - Player: Federer - Titles: 20 - Country: Switzerland
report = 'Report: Daily_Report - Time: 2020-10-10T12:30:59.000000 - Player: Federer - Titles: 20 - Country: Switzerland'
2. First thing I noticed from the report is the seperator which is "-". We will go ahead and parse the report with "-"
fields = report.split(' - ')
name, time, player , titles, _ = fields
print(f"Output \n *** The report name {name} generated on {time} has {titles} titles for {player}. ")
Output
*** The report name Report: Daily_Report generated on Time: 2020-10-10T12:30:59.000000 has Titles: 20 titles for Player: Federer.
3. Now the output is not as expected as we can see still some labels like Report:, Time:, Player: which are not required.
# extract only report name
formatted_name = name.split(':')[1]
# extract only player
formatted_player = player.split(':')[1]
# extract only titles
formatted_titles = int(titles.split(':')[1])
# extract only titles
new_time = time.split(': ')[1]
print(f"Output \n {formatted_name} , {new_time}, {formatted_player} , {formatted_titles}")
Output
Daily_Report , 2020-10-10T12:30:59.000000, Federer , 20
4. Now the timestamp is in ISO format, which can be split if you want to or leave it as is. Let me show you how you can split a timestamp field.
from datetime import datetime
formatted_date = datetime.fromisoformat(new_time)
print(f"Output \n{formatted_date}")
Output
2020-10-10 12:30:59
Now we will combine all these steps to a single function.
def parse_function(log):
"""
Function : Parse the given log in the format
Report: <> - Time: <> - Player: <> - Titles: - Country: <>
Args : log
Return : required data
"""
fields = log.split(' - ')
name, time, player , titles, _ = fields
# extract only report name
formatted_name = name.split(':')[1]
# extract only player
formatted_player = player.split(':')[1]
# extract only titles
formatted_titles = int(titles.split(':')[1])
# extract only titles
new_time = time.split(': ')[1]
return f"{formatted_name} , {new_time}, {formatted_player} , {formatted_titles}"
if __name__ == '__main__':
report = 'Report: Daily_Report - Time: 2020-10-10T12:30:59.000000 - Player: Federer - Titles: 20 - Country: Switzerland'
data = parse_function(report)
print(f"Output \n{data}")
Output
Daily_Report , 2020-10-10T12:30:59.000000, Federer , 20
6. We can use parse module to make it a bit simple. As you see the format create a template. We can use parse module to do this a bit more easily.
First install the parse module by - pip install parse
from parse import parse
report = 'Report: Daily_Report - Time: 2020-10-10T12:30:59.000000 - Player: Federer - Titles: 20 - Country: Switzerland'
# Looking at the report, create a template
template = 'Report: {name} - Time: {time} - Player: {player} - Titles: {titles} - Country: {country}'
# Run parse and check the results
data = parse(template, report)
print(f"Output \n{data}")
Output
<Result () {'name': 'Daily_Report', 'time': '2020-10-10T12:30:59.000000', 'player': 'Federer', 'titles': '20', 'country': 'Switzerland'}>
7. With a simple one liner we are able to extract the data from the log by defining the template. Now let us extract individual values.
print(f"Output \n {data['name']} - {data['time']} - {data['player']} - {data['titles']} - {data['country']}")
Output
Daily_Report - 2020-10-10T12:30:59.000000 - Federer - 20 - Switzerland
Conclusion :
You have seen couple of methods to parse the required data from the log file. Prefer defining the template and use the parse module to extract the data required.