Selected Reading

Python - Regular Expressions

Quiz

A regular expression in python is often abbreviated as "regex" or "regexp". This is a sequence of characters used to search for specific patterns within text. It provides a specialized syntax that allows you to define rules for matching strings or sets of strings.

These patterns are essential for various operations in string processing such as finding occurrences of patterns, replacing text based on rules and validating the format of input data.

In data science and other fields where extensive text manipulation is required regular expressions play a crucial role. They are supported by many programming languages such as Python which includes the `re` module in its standard library specifically for handling regular expressions. Before working with the regular expressions it is important to understand the raw strings.

Raw Strings

In Python's regular expressions, raw strings are string literals prefixed with an 'r' character. This notation indicates to Python that backslashes within the string should be treated as literal characters rather than as escape characters.

Simply we can say a string become a raw string if it is prefixed with r or R before the quotation symbols. Here Hello is a normal string and r'Hello' is a raw string.

>>> normal="Hello"
>>> print (normal)
Hello
>>> raw=r"Hello"
>>> print (raw)
Hello

In normal circumstances, there is no difference between the two. However, when the escape character is embedded in the string, the normal string actually interprets the escape sequence, where as the raw string doesn't process the escape character.

In the below example, when a normal string is printed the escape character '\n' is processed to introduce a newline. However because of the raw string operator 'r' the effect of escape character is not translated as per its meaning.

>>> normal="Hello\nWorld"
>>> print (normal)
Hello
World
>>> raw=r"Hello\nWorld"
>>> print (raw)
Hello\nWorld

Example

Following is the example of the using the raw strings in the python regualr expressions. Here we are finding the date from the pattern defined in the raw string.

import re

# Regular expression to match a date format 'YYYY-MM-DD'
pattern = r'\d{4}-\d{2}-\d{2}'

# Text containing a date
text = 'Event on 2018-06-12'

# Using re.findall to find all occurrences of the date pattern
matches = re.findall(pattern, text)
print(matches)  # Output: ['2018-06-12']

Output

['2018-06-12']

Metacharacters

Metacharacters in regular expressions (regex) are special characters with a predefined meaning and behavior. They allow us to construct complex patterns for searching, matching, and manipulating text.

Here's a complete list of the metacharacters −

. ^ $ * + ? { } [ ] \ | ( )

The square bracket symbols[ and ] indicate a set of characters that you wish to match. Characters can be listed individually, or as a range of characters separating them by a '-'.

[abc]	match any of the characters a, b, or c
[a-c]	which uses a range to express the same set of characters.
[a-z]	match only lowercase letters.
[0-9]	match only digits.
'^'	complements the character set in [].[^5] will match any character except'5'.

'\'is an escaping metacharacter. When followed by various characters it forms various special sequences. If you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \\.

Predefined sets of characters represented by such special sequences beginning with '\' are listed below −

\d	Matches any decimal digit; this is equivalent to the class [0-9].
\D	Matches any non-digit character; this is equivalent to the class [^0-9].
\s	Matches any whitespace character; this is equivalent to the class [\t\n\r\f\v].
\S	Matches any non-whitespace character; this is equivalent to the class [^\t\n\r\f\v].
\w	Matches any alphanumeric character; this is equivalent to the class [a-zAZ0-9_].
\W	Matches any non-alphanumeric character. equivalent to the class [^a-zAZ0-9_].
.	Matches with any single character except newline '\n'.
?	match 0 or 1 occurrence of the pattern to its left
+	1 or more occurrences of the pattern to its left
*	0 or more occurrences of the pattern to its left
\b	boundary between word and non-word and /B is opposite of /b
[..]	Matches any single character in a square bracket and [^..] matches any single character not in square bracket.
\	It is used for special meaning characters like \. to match a period or \+ for plus sign.
{n,m}	Matches at least n and at most m occurrences of preceding
a\| b	Matches either a or b

Example

Following is the example shows the use of metacharacters in regular expressions of python −

import re

# Example text
text = "The learning starts here. Welcome to Tutorialspoint and have a happy learning."

# Using a regex pattern to find words starting with 'The' or 'the'
pattern = r'\b[Tt]he\b'

# Finding all occurrences of the pattern in the text
matches = re.findall(pattern, text)

print(matches)

Output

['The']

Python's re module provides useful functions for finding a match, searching for a pattern, and substitute a matched string with other string etc.

re.match() Function

This function attempts to match RE pattern at the start of string with optional flags.

Here is the syntax for this function −

re.match(pattern, string, flags=0)

Here is the description of the parameters −

Sr.No.	Parameter & Description
1	pattern This is the regular expression to be matched.
2	String This is the string, which would be searched to match the pattern at the beginning of string.
3	Flags You can specify different flags using bitwise OR (\|). These are modifiers, which are listed in the table below.

Sr.No.

Parameter & Description

pattern

This is the regular expression to be matched.

String

This is the string, which would be searched to match the pattern at the beginning of string.

Flags

You can specify different flags using bitwise OR (|). These are modifiers, which are listed in the table below.

The re.match() function returns a match object on success, None on failure. A match object instance contains information about the match: where it starts and ends, the substring it matched with, etc.

The match object's start() method returns the starting position of pattern in the string, and end() returns the endpoint.

If the pattern is not found, the match object is None.

We use group(num) or groups() function of match object to get matched expression.

Match Object Methods	Description
group(num=0)	This method returns entire match (or specific subgroup num)
groups()	This method returns all matching subgroups in a tuple (empty if there weren't any)

Example

import re
line = "Cats are smarter than dogs"
matchObj = re.match( r'Cats', line)
print (matchObj.start(), matchObj.end())
print ("matchObj.group() : ", matchObj.group())

Output

0 4
matchObj.group() : Cats

re.search() Function

This function searches for first occurrence of RE pattern within the string, with optional flags.

Here is the syntax for this function −

re.search(pattern, string, flags=0)

Here is the description of the parameters −

Sr.No.	Parameter & Description
1	Pattern This is the regular expression to be matched.
2	String This is the string, which would be searched to match the pattern anywhere in the string.
3	Flags You can specify different flags using bitwise OR (\|). These are modifiers, which are listed in the table below.

Sr.No.

Parameter & Description

Pattern

This is the regular expression to be matched.

String

This is the string, which would be searched to match the pattern anywhere in the string.

Flags

You can specify different flags using bitwise OR (|). These are modifiers, which are listed in the table below.

The re.search function returns a match object on success, none on failure. We use group(num) or groups() function of match object to get the matched expression.

Match Object Methods	Description
group(num=0)	This method returns entire match (or specific subgroup num)
groups()	This method returns all matching subgroups in a tuple (empty if there weren't any)

Example

import re
line = "Cats are smarter than dogs"
matchObj = re.search( r'than', line)
print (matchObj.start(), matchObj.end())
print ("matchObj.group() : ", matchObj.group())

Output

17 21
matchObj.group() : than

Matching Vs Searching

Python offers two different primitive operations based on regular expressions such as match and search. The match checks for a match only at the beginning of the string, while search checks for a match anywhere in the string i.e. this is what Perl does by default.

Example

import re
line = "Cats are smarter than dogs";
matchObj = re.match( r'dogs', line, re.M|re.I)
if matchObj:
   print ("match --> matchObj.group() : ", matchObj.group())
else:
   print ("No match!!")
searchObj = re.search( r'dogs', line, re.M|re.I)
if searchObj:
   print ("search --> searchObj.group() : ", searchObj.group())
else:
   print ("Nothing found!!")

When the above code is executed, it produces the following output −

No match!!
search --> matchObj.group() : dogs

re.findall() Function

The findall() function returns all non-overlapping matches of pattern in string, as a list of strings or tuples. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result.

Syntax

re.findall(pattern, string, flags=0)

Parameters

Sr.No.	Parameter & Description
1	Pattern This is the regular expression to be matched.
2	String This is the string, which would be searched to match the pattern anywhere in the string.
3	Flags You can specify different flags using bitwise OR (\|). These are modifiers, which are listed in the table below.

Sr.No.

Parameter & Description

Pattern

This is the regular expression to be matched.

String

This is the string, which would be searched to match the pattern anywhere in the string.

Flags

You can specify different flags using bitwise OR (|). These are modifiers, which are listed in the table below.

Example

import re
string="Simple is better than complex."
obj=re.findall(r"ple", string)
print (obj)

Output

['ple', 'ple']

Example

Following code obtains the list of words in a sentence with the help of findall() function.

import re
string="Simple is better than complex."
obj=re.findall(r"\w*", string)
print (obj)

Output

['Simple', '', 'is', '', 'better', '', 'than', '', 'complex', '', '']

re.sub() Function

One of the most important re methods that use regular expressions is sub.

Syntax

re.sub(pattern, repl, string, max=0)

This method replaces all occurrences of the RE pattern in string with repl, substituting all occurrences unless max is provided. This method returns modified string.

Example

import re
phone = "2004-959-559 # This is Phone Number"

# Delete Python-style comments
num = re.sub(r'#.*$', "", phone)
print ("Phone Num : ", num)

# Remove anything other than digits
num = re.sub(r'\D', "", phone)
print ("Phone Num : ", num)

Output

Phone Num : 2004-959-559
Phone Num : 2004959559

Example

The following example uses sub() function to substitute all occurrences of is with was word −

import re
string="Simple is better than complex. Complex is better than complicated."
obj=re.sub(r'is', r'was',string)
print (obj)

Output

Simple was better than complex. Complex was better than complicated.

re.compile() Function

The compile() function compiles a regular expression pattern into a regular expression object, which can be used for matching using its match(), search() and other methods.

Syntax

re.compile(pattern, flags=0)

Flags

Sr.No.	Modifier & Description
1	re.I Performs case-insensitive matching.
2	re.L Interprets words according to the current locale. This interpretation affects the alphabetic group (\w and \W), as well as word boundary behavior (\b and \B).
3	re. M Makes $ match the end of a line (not just the end of the string) and makes ^ match the start of any line (not just the start of the string).
4	re.S Makes a period (dot) match any character, including a newline.
5	re.U Interprets letters according to the Unicode character set. This flag affects the behavior of \w, \W, \b, \B.
6	re.X Permits "cuter" regular expression syntax. It ignores whitespace (except inside a set [] or when escaped by a backslash) and treats unescaped # as a comment marker.

In python regular expressions using re.compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program.

Example

import re
string="Simple is better than complex. Complex is better than complicated."
pattern=re.compile(r'is')
obj=pattern.match(string)
obj=pattern.search(string)
print (obj.start(), obj.end())

obj=pattern.findall(string)
print (obj)

obj=pattern.sub(r'was', string)
print (obj)

It will produce the following output −

7 9
['is', 'is']
Simple was better than complex. Complex was better than complicated.

re.finditer() Function

This function returns an iterator yielding match objects over all non-overlapping matches for the RE pattern in string.

Syntax

re.finditer(pattern, string, flags=0)

Example

import re
string="Simple is better than complex. Complex is better than
complicated."
pattern=re.compile(r'is')
iterator = pattern.finditer(string)
print (iterator )

for match in iterator:
   print(match.span())

Output

(7, 9)
(39, 41)

Use Cases of Python Regex

Finding all Adverbs

findall() matches all occurrences of a pattern, not just the first one as search() does. For example, if a writer wanted to find all of the adverbs in some text, they might use findall() in the following manner −

import re
text = "He was carefully disguised but captured quickly by police."
obj = re.findall(r"\w+ly\b", text)
print (obj)

Output

['carefully', 'quickly']

Finding words starting with vowels

import re
text = 'Errors should never pass silently. Unless explicitly silenced.'
obj=re.findall(r'\b[aeiouAEIOU]\w+', text)
print (obj)

Output

['Errors', 'Unless', 'explicitly']

Previous Quiz Next