How to extract date from text using Python regular expression?


We must first understand some regular expression fundamentals as we will use them. There are various ways to declare patterns in regular expressions, which might make them appear complex but are pretty simple. Regular expressions are patterns that can be used to match strings that adhere to that pattern. You need to read the following article to learn how regular expressions operate.

You may commonly extract dates from a given text when learning to code. If you are automating a Python script and need to extract specific numerical figures from a CSV file, if you are a data scientist and need to separate complex date from given patterns, or if you are a Python enthusiast who wants to learn more about strings and numerical data types, you will undoubtedly find this helpful article.

It is expected that you are familiar with regular expression fundamentals moving forward.

Example 1

Only the basic notations will be used to create a regex pattern for dates. We aim to match dates that have the elements day, month, year, or day, month, and year, with the elements day and month having two digits and the element year having four digits. Now let's build the pattern piece by piece.

d will match digits, as you would have guessed. We need to supply the number 2 within to match strings that have precisely 2 digits. Therefore, "d2" will match any string that only has 2 digits. The pattern for the day, the month, and the year is d2, d2, and d4, respectively. These three must be joined together with a '/' or '-'.

The latest regex pattern is "d2" followed by "d2" and "d4".

Now that the problematic portion is finished, the remaining task is easy.


Input 1

import re

#Open the file that you want to search
f = open("doc.txt", "r")

#Will contain the entire content of the file as a string
content = f.read()

#The regex pattern that we created
pattern = "\d{2}[/-]\d{2}[/-]\d{4}"

#Will return all the strings that are matched
dates = re.findall(pattern, content)

It should be noted that our regex pattern will also extract dates that aren't legitimate, such 40/32/2019. The final code must be modified to appear as follows:

Input 2

import re

#Open the file that you want to search
f = open("doc.txt", "r")

#Will contain the entire content of the file as a string
content = f.read()

#The regex pattern that we created
pattern = "\d{2}[/-]\d{2}[/-]\d{4}"

#Will return all the strings that are matched
dates = re.findall(pattern, content)

for date in dates:
   if "-" in date:
      day, month, year = map(int, date.split("-"))
   else:
      day, month, year = map(int, date.split("/"))
   if 1 <= day <= 31 and 1 <= month <= 12:
      print(date)
f.close()

Input Text

For example, if the content of the text file is as follows

My name is XXX. I was born on 07/12/2001 in YYY city.
I graduated from ZZZ college on 07-28-2019.

Output

07/04/1998
09-05-2019

Example 2

import datetime
from datetime import date
import re
s = "Jason's birthday is on 2002-07-28"
match = re.search(r'\d{4}-\d{2}-\d{2}', s)
date = datetime.datetime.strptime(match.group(), '%Y-%m-%d').date()
print (date)

Output

2002-07-28

Conclusion

As a result of the conversations above, we discovered various Python functions for extracting date from a given text. The regex module is undoubtedly our personal favorite, though. You may counter that alternative approaches, such as the split() functions, result in speedier execution and more straightforward, more understandable code. However, as previously stated, it does not yield negative values (about method 2), nor does it function for floating-point numbers with no space between them and other characters, such as "25.50k" (about method 2). Furthermore, speed is essentially a useless statistic when it comes to log parsing. You can now understand why, out of all the options on this list, regex is my personal preference.

Updated on: 02-Nov-2023

9K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements