- Data Structure
- Networking
- RDBMS
- Operating System
- Java
- MS Excel
- iOS
- HTML
- CSS
- Android
- Python
- C Programming
- C++
- C#
- MongoDB
- MySQL
- Javascript
- PHP
- Physics
- Chemistry
- Biology
- Mathematics
- English
- Economics
- Psychology
- Social Studies
- Fashion Studies
- Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
How to extract required data from structured strings in Python?
Introduction...
I will show you couple of methods to extract require data/fields from structured strings. These approaches will help, where the format of the input structure is in a known format.
How to do it..
1. Let us create one dummy format to understand the approach.
Report: <> - Time: <> - Player: <> - Titles: - Country: <>
Report: Daily_Report - Time: 2020-10-16T01:01:01.000001 - Player: Federer - Titles: 20 - Country: Switzerland
report = 'Report: Daily_Report - Time: 2020-10-10T12:30:59.000000 - Player: Federer - Titles: 20 - Country: Switzerland'
2. First thing I noticed from the report is the seperator which is "-". We will go ahead and parse the report with "-"
fields = report.split(' - ') name, time, player , titles, _ = fields print(f"Output \n *** The report name {name} generated on {time} has {titles} titles for {player}. ")
Output
*** The report name Report: Daily_Report generated on Time: 2020-10-10T12:30:59.000000 has Titles: 20 titles for Player: Federer.
3. Now the output is not as expected as we can see still some labels like Report:, Time:, Player: which are not required.
# extract only report name formatted_name = name.split(':')[1] # extract only player formatted_player = player.split(':')[1] # extract only titles formatted_titles = int(titles.split(':')[1]) # extract only titles new_time = time.split(': ')[1] print(f"Output \n {formatted_name} , {new_time}, {formatted_player} , {formatted_titles}")
Output
Daily_Report , 2020-10-10T12:30:59.000000, Federer , 20
4. Now the timestamp is in ISO format, which can be split if you want to or leave it as is. Let me show you how you can split a timestamp field.
from datetime import datetime formatted_date = datetime.fromisoformat(new_time) print(f"Output \n{formatted_date}")
Output
2020-10-10 12:30:59
Now we will combine all these steps to a single function.
def parse_function(log): """ Function : Parse the given log in the format Report: <> - Time: <> - Player: <> - Titles: - Country: <> Args : log Return : required data """ fields = log.split(' - ') name, time, player , titles, _ = fields # extract only report name formatted_name = name.split(':')[1] # extract only player formatted_player = player.split(':')[1] # extract only titles formatted_titles = int(titles.split(':')[1]) # extract only titles new_time = time.split(': ')[1] return f"{formatted_name} , {new_time}, {formatted_player} , {formatted_titles}" if __name__ == '__main__': report = 'Report: Daily_Report - Time: 2020-10-10T12:30:59.000000 - Player: Federer - Titles: 20 - Country: Switzerland' data = parse_function(report) print(f"Output \n{data}")
Output
Daily_Report , 2020-10-10T12:30:59.000000, Federer , 20
6. We can use parse module to make it a bit simple. As you see the format create a template. We can use parse module to do this a bit more easily.
First install the parse module by - pip install parse
from parse import parse report = 'Report: Daily_Report - Time: 2020-10-10T12:30:59.000000 - Player: Federer - Titles: 20 - Country: Switzerland' # Looking at the report, create a template template = 'Report: {name} - Time: {time} - Player: {player} - Titles: {titles} - Country: {country}' # Run parse and check the results data = parse(template, report) print(f"Output \n{data}")
Output
<Result () {'name': 'Daily_Report', 'time': '2020-10-10T12:30:59.000000', 'player': 'Federer', 'titles': '20', 'country': 'Switzerland'}>
7. With a simple one liner we are able to extract the data from the log by defining the template. Now let us extract individual values.
print(f"Output \n {data['name']} - {data['time']} - {data['player']} - {data['titles']} - {data['country']}")
Output
Daily_Report - 2020-10-10T12:30:59.000000 - Federer - 20 - Switzerland
Conclusion :
You have seen couple of methods to parse the required data from the log file. Prefer defining the template and use the parse module to extract the data required.
To Continue Learning Please Login
Login with Google