Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to extract data from a string with Python Regular Expressions?
In this article you will find out how to extract data from a string with Python Regular Expressions. In Python, extracting data from the given string is a common task. Regular expressions (regex) offer pattern-matching functionality to get and identify specific parts of a string.
Python's re module helps in working with regex easily. The common functions of this module are re.search(), re.findall() and re.match() to make it easier to extract desired data.
Common Regular Expression Functions
| Function | Purpose | Returns |
|---|---|---|
re.findall() |
Find all matches | List of strings |
re.search() |
Find first match | Match object or None |
re.match() |
Match from start | Match object or None |
Extracting Digits from a String
The following example will extract digits from the given string using the \d+ regex pattern. This pattern matches one or more consecutive digits ?
import re
# Define your text here
txt = "My ID: 89456, Ref num: 7863"
# Extract all the numbers from the string using findall()
nums = re.findall(r"\d+", txt)
# Print the result
print("Extracted numbers:", nums)
print("Number of digits found:", len(nums))
Extracted numbers: ['89456', '7863'] Number of digits found: 2
Extracting Email Addresses
Here we use the regex pattern \b[\w.-]+@[\w.-]+\.\w+\b for finding email addresses in text. This pattern matches email addresses like username@domain.com ?
import re
# Define your text here which contains email IDs
txt = "Contact us at contact@tutorialspoint.com or info@tutorix.com for support"
# Extract email IDs
emails = re.findall(r"\b[\w.-]+@[\w.-]+\.\w+\b", txt)
# Print the result
print("Found emails:", emails)
for i, email in enumerate(emails, 1):
print(f"Email {i}: {email}")
Found emails: ['contact@tutorialspoint.com', 'info@tutorix.com'] Email 1: contact@tutorialspoint.com Email 2: info@tutorix.com
Extracting Hashtags
Hashtags are widely used on social media platforms. The pattern #\w+ looks for words prefixed with # symbol ?
import re
# Define your text here which contains hashtags
txt = "Latest trending topics are: #Python #Coding #AI #MachineLearning"
# Extract hashtags using the findall() method
tags = re.findall(r"#\w+", txt)
# Print the result
print("Hashtags found:", tags)
print("Total hashtags:", len(tags))
Hashtags found: ['#Python', '#Coding', '#AI', '#MachineLearning'] Total hashtags: 4
Extracting Dates
The pattern \d{4}-\d{2}-\d{2} matches dates in YYYY-MM-DD format. It looks for four digits, a dash, two digits, a dash, and two digits ?
import re
# Define your text here which contains some dates
txt = "Important events: 2023-08-15, 2025-05-29, 2024-12-01 are scheduled"
# Extract dates using the findall() method
dates = re.findall(r"\d{4}-\d{2}-\d{2}", txt)
# Print the result
print("Dates found:", dates)
for date in dates:
year, month, day = date.split('-')
print(f"Year: {year}, Month: {month}, Day: {day}")
Dates found: ['2023-08-15', '2025-05-29', '2024-12-01'] Year: 2023, Month: 08, Day: 15 Year: 2025, Month: 05, Day: 29 Year: 2024, Month: 12, Day: 01
Pattern Syntax Summary
| Pattern | Meaning | Example |
|---|---|---|
\d+ |
One or more digits | 123, 45 |
\w+ |
One or more word characters | Python, AI |
\b |
Word boundary | Start/end of word |
. |
Any character except newline | a, 1, @ |
Conclusion
Regular expressions provide powerful pattern matching for data extraction. Use re.findall() to extract all matches from text, and combine different patterns like \d+ for digits and #\w+ for hashtags to extract specific data types efficiently.
