Python Regex Cheat Sheet

Python Server Side Programming Programming

Regular expressions, commonly known as regex, are powerful tools for pattern matching and text manipulation in Python programming. They allow you to search, extract, and modify text based on specific patterns, making them essential for tasks such as data validation, string manipulation, and text processing.

However, working with regular expressions can be challenging, especially for beginners or those who don't use them frequently. Remembering the syntax and understanding the various metacharacters and rules can be daunting.

To make your regex journey smoother, we have created a comprehensive Python Regex Cheat Sheet. This cheat sheet serves as a handy reference guide, providing you with a quick overview of the most commonly used metacharacters, character classes, quantifiers, anchors, groups, flags, escape sequences, and special characters in Python regex.

Note − Remember to import the re module in your Python script to work with regular expressions.

Commonly Used Metacharacters

Metacharacters are special characters in regex that carry a specific meaning and are used to define patterns. Understanding and utilizing these metacharacters is essential for effective pattern matching. In this section, we'll explore some of the most commonly used metacharacters in Python regex.

Dot (.) −The dot metacharacter matches any character except a newline. It is often used to represent a wildcard character, allowing you to match any character in a given position.

Example

import re

pattern = r"b.ttle"
text1 = "bottle"
text2 = "battle"
text3 = "bottle\n"

match1 = re.search(pattern, text1)
match2 = re.search(pattern, text2)
match3 = re.search(pattern, text3)

print(match1) 
print(match2)  
print(match3)

Output

<re.Match object; span=(0, 6), match='bottle'>
<re.Match object; span=(0, 6), match='battle'>
<re.Match object; span=(0, 6), match='bottle'>

In the above example, the dot metacharacter . is used to match any character in the pattern b.ttle. It successfully matches "bottle" and "battle" but fails to match "bottle\n" since the dot does not match newline characters.

Caret (^) and Dollar Sign ($) −The caret and dollar sign metacharacters represent the start and end of a line respectively. They are used to anchor patterns at the beginning or end of a line.

Example

import re

pattern1 = r"^Python"
pattern2 = r"\d$"
text1 = "Python is a powerful language"
text2 = "The price is $10"

match1 = re.search(pattern1, text1)
match2 = re.search(pattern2, text2)

print(match1)
print(match2)

Output

<re.Match object; span=(0, 6), match='Python'>
<re.Match object; span=(15, 16), match='0'>

In the above example, the caret ^ is used to anchor the pattern "Python" at the beginning of the line, successfully matching it in text1. The dollar sign $ is used to anchor the pattern \d (which matches any digit) at the end of the line, successfully matching the digit "0" in text2.

Square Brackets ([...]) − Square brackets are used to create a character class, allowing you to match a single character from a set of specified characters. You can include multiple characters or ranges within the brackets.

Example

import re

pattern = r"[aeiou]"
text = "Hello, World!"

matches = re.findall(pattern, text)

print(matches)

Output

['e', 'o', 'o']

In the above example, the pattern [aeiou] is used to match any vowel character in the text. The findall() function returns a list of all matches found, which in this case are the characters 'e', 'o', and 'o'.

Pipe (|) − The pipe character is used as an OR operator, allowing you to match either the pattern on the left or the pattern on the right.

Example

import re

pattern = r"cat|dog"
text = "I have a cat and a dog"

match = re.search(pattern, text)

print(match)

Output

<re.Match object; span=(9, 12), match='cat'>

In the above example, the pattern cat|dog matches either "cat" or "dog". The search() function returns the first match found, which in this case is "cat".

These are just a few examples of commonly used metacharacters in Python regex. In the next section, we'll explore character classes and quantifiers to further enhance our pattern matching capabilities.

Character Classes and Quantifiers

Character classes and quantifiers provide additional flexibility and control when defining regex patterns. In this section, we'll delve into these features and learn how to use them effectively.

Character Classes − Character classes allow you to specify a set of characters that can match at a particular position in the pattern. They are enclosed within square brackets [ ] and provide a way to match any single character from the defined set.

Example

import re

pattern = r"[aeiou]"
text = "Hello, World!"

matches = re.findall(pattern, text)

print(matches)

Output

['e', 'o', 'o']

In the above example, the character class [aeiou] matches any vowel character in the text. The findall() function returns a list of all matches found, which in this case are the characters 'e', 'o', and 'o'.

Negated Character Classes − Negated character classes allow you to match any character that is not in the defined set. They are denoted by including a caret ^ at the beginning of the character class.

Example

import re

pattern = r"[^aeiou]"
text = "Hello, World!"

matches = re.findall(pattern, text)

print(matches)

Output

['H', 'l', 'l', ',', ' ', 'W', 'r', 'l', 'd', '!']

In the above example, the negated character class [^aeiou] matches any character that is not a vowel. The findall() function returns a list of all matches found, which includes all the consonant characters and punctuation marks.

Quantifiers − Quantifiers allow you to specify the number of occurrences of a pattern that should be matched. They can be applied to individual characters, character classes, or groups of patterns −

− Matches zero or more occurrences of the preceding pattern.
+ − Matches one or more occurrences of the preceding pattern.
? − Matches zero or one occurrence of the preceding pattern.
{n} − Matches exactly n occurrences of the preceding pattern.
{n,} − Matches at least n occurrences of the preceding pattern.
{n,m} − Matches between n and m occurrences of the preceding pattern.

Example

import re

pattern = r"ab*c"
text1 = "ac"
text2 = "abc"
text3 = "abbbbc"

match1 = re.search(pattern, text1)
match2 = re.search(pattern, text2)
match3 = re.search(pattern, text3)

print(match1)  
print(match2)  
print(match3)

Output

<re.Match object; span=(0, 2), match='ac'>
<re.Match object; span=(0, 3), match='abc'>
<re.Match object; span=(0, 6), match='abbbbc'>

In the above example, the quantifier * is used to match zero or more occurrences of the character 'b' in the pattern ab*c. It successfully matches "ac", "abc", and "abbbbc" as the 'b' character is optional.

By combining character classes, negated character classes, and quantifiers, you can create powerful and flexible regex patterns to match various patterns within a given text.

In the next section, we'll explore more advanced features of Python regex, including capturing groups, anchors, and lookaheads.

Capturing Groups, Anchors, and Lookaheads

Capturing groups, anchors, and lookaheads are advanced features of Python regex that provide more control over pattern matching. In this section, we'll explore these features and understand how to use them effectively.

Capturing Groups − Capturing groups allow you to define subpatterns within a larger pattern and extract the matched content. They are defined using parentheses ( ) and are useful when you want to extract specific parts of a match.

Example

import re

pattern = r"(\d{2})-(\d{2})-(\d{4})"
text = "Date of Birth: 01-23-1990"

match = re.search(pattern, text)

if match:
    day = match.group(1)
    month = match.group(2)
    year = match.group(3)
    print(f"Day: {day}, Month: {month}, Year: {year}")

Output

Day: 01, Month: 23, Year: 1990

In the above example, the pattern (\d{2})-(\d{2})-(\d{4}) defines three capturing groups to match the day, month, and year in a date format. The search() function returns a match object, and the group() method is used to extract the matched values. The output will be "Day: 01, Month: 23, Year: 1990".

Anchors − Anchors are used to specify the position in the text where a match should occur. They do not match any characters but rather assert a condition about the surrounding text. Two commonly used anchors are ^ and $.

^ − Matches the start of a string.
$ − Matches the end of a string.

Example

import re

pattern = r"^Python"
text = "Python is a popular programming language"

match = re.search(pattern, text)

if match:
    print("Match found!")
else:
    print("No match found.")

Output

Match found!

In the above example, the pattern ^Python matches the word "Python" only if it occurs at the beginning of the text. Since the text starts with "Python", a match is found, and the corresponding message is printed.

Lookaheads − Lookaheads are used to specify a condition that must be followed by the pattern for a match to occur. They are denoted by (?=...) for positive lookaheads and (?!...) for negative lookaheads.

Example

import re

pattern = r"\b\w+(?=ing\b)"
text = "Walking is good for health"

matches = re.findall(pattern, text)

print(matches)

Output

['Walk']

In the above example, the pattern \b\w+(?=ing\b) matches any word that is followed by the suffix "ing". The positive lookahead (?=ing\b) asserts that the word should be followed by "ing", but it is not part of the actual match. The findall() function returns a list of all matching words, which in this case is "Walk".

By utilizing capturing groups, anchors, and lookaheads, you can create more sophisticated regex patterns to precisely match and extract specific content within a text.

In the next section, we'll explore additional advanced features of Python regex, including backreferences, flags, and advanced modifiers.

Backreferences, Flags, and Advanced Modifiers

Backreferences, flags, and advanced modifiers are powerful features of Python regex that enhance pattern matching capabilities. In this section, we'll delve into these features and learn how to leverage them effectively.

Backreferences − Backreferences allow you to refer to previously captured groups within the pattern. They are denoted using the backslash \ followed by the group number or name. Backreferences are useful when you want to match repetitive patterns or ensure consistency in the matched content.

Example

import re

pattern = r"(\w+)\s+\1"
text = "hello hello"

match = re.search(pattern, text)

if match:
    print("Match found!")
else:
    print("No match found.")

Output

Match found!

In the above example, the pattern (\w+)\s+\1 matches a word followed by one or more whitespaces and then the same word again. The backreference \1 refers to the first captured group, which ensures that the same word is repeated. Since the text contains "hello hello", a match is found, and the corresponding message is printed.

Flags − Flags modify the behavior of the regex pattern matching. They are denoted using the re module constants and can be passed as an optional argument to the regex functions. Some commonly used flags are: −

re.IGNORECASE − Ignore case when matching.
re.MULTILINE − Enable multiline matching.
re.DOTALL − Allow the dot (.) to match any character, including newline.

Example

import re

pattern = r"python"
text = "Python is a popular programming language"

match = re.search(pattern, text, re.IGNORECASE)

if match:
    print("Match found!")
else:
    print("No match found.")

Output

Match found!

In the above example, the pattern python is matched against the text with the re.IGNORECASE flag. As a result, the case difference is ignored, and a match is found despite the word "Python" starting with an uppercase letter.

Advanced Modifiers − Advanced modifiers provide additional control over regex matching behavior. They are denoted using special characters placed after the closing delimiter of the regex pattern.

? − Makes the preceding pattern optional (matches 0 or 1 occurrence).
− Matches 0 or more occurrences of the preceding pattern.
+ − Matches 1 or more occurrences of the preceding pattern.
{m} − Matches exactly m occurrences of the preceding pattern.
{m, n} − Matches between m and n occurrences of the preceding pattern.

Example

import re

pattern = r"apples?|bananas?"
text = "I like apple and bananas"

matches = re.findall(pattern, text)

print(matches)

Output

['apple', 'bananas']

In the above example, the pattern apples?|bananas? matches either "apple" or "apples" and "banana" or "bananas". The ? modifier makes the preceding character or group optional, allowing for matching both singular and plural forms of the fruit names.

By using backreferences, flags, and advanced modifiers, you can create more flexible and dynamic regex patterns to handle various matching scenarios.

In the next section, we'll discuss common regex pitfalls and best practices to improve your regex skills.

Common Regex Pitfalls and Best Practices

While regular expressions are powerful tools for pattern matching, they can also be prone to pitfalls if not used correctly. In this section, we'll explore some common pitfalls and provide best practices to help you avoid them.

Greedy vs. Non-Greedy Matching One common pitfall is the greedy matching behavior of regex, where patterns match as much as possible. This can lead to unintended results, especially when using quantifiers like * and +. To mitigate this, you can use the non-greedy modifiers *? and +? to match as little as possible.

Example

import re

text = "<html><body><h1>Title</h1></body></html>"

pattern = r"<.*?>"
matches = re.findall(pattern, text)

print(matches)

Output

['<html>', '<body>', '<h1>', '</h1>', '</body>', '</html>']

In the above example, the pattern <.*?> matches HTML tags. The .*? non-greedy modifier ensures that the match stops at the first occurrence of >. Without the non-greedy modifier, the match would span across the entire text, including multiple tags.

Anchoring Matches − Anchoring matches can prevent unintended matches at unexpected positions within the text. Anchors are special characters that mark the start (^) and end ($) of a line or the entire text.

Example

import re

text = "The quick brown fox jumps over the lazy dog."

pattern = r"\bfox\b"
matches = re.findall(pattern, text)

print(matches)  # Output: ['fox']

Output

['fox']

In the above example, the pattern \bfox\b matches the word "fox" as a whole word. The \b anchor ensures that "fox" is not matched as part of another word, like "foxy" or "foxes".

Complicated Nested Patterns When dealing with complex patterns involving nested groups, it's important to use named groups and proper pattern organization for readability and maintainability.

Example

import re

text = "Date: 2022-01-01, Time: 12:00 PM"

pattern = r"Date: (?P<date>\d{4}-\d{2}-\d{2}), Time: (?P<time>\d{2}:\d{2} [AP]M)"
match = re.search(pattern, text)

if match:
    date = match.group("date")
    time = match.group("time")
    print(f"Date: {date}, Time: {time}")

Output

Date: 2022-01-01, Time: 12:00 PM

In the above example, the pattern uses named groups (?P<name>pattern) to capture the date and time information. This approach improves code readability and allows easy access to the captured values using the group name.

Conclusion

Regular expressions are a powerful tool for pattern matching and text manipulation in Python. By understanding the basic syntax, metacharacters, and common regex techniques, you can unlock a wide range of possibilities for working with text data.

Mrudgandha Kulkarni

Updated on: 10-Aug-2023

266 Views

Kickstart Your Career

Get certified by completing the course

Get Started