Why do we use question mark literal in Python regular expression?


Introduction

The question mark makes the previous token in the regular expression optional. For example: colou?r is complementary to both colour and colour. A quantifier is what the question mark is known as.

You may make multiple tokens optional by combining numerous tokens in parentheses and adding the question mark after the final set of parentheses. Like Nov(ember)? matches between Nov and Nov.

Using many question marks, you may create a regular expression matching a wide range of options. Feb(ruary)? 23(rd)? Matches February 23rd, February 23, Feb 23rd and Feb 23.

Curly braces can also be used to make something optional. The equivalent of colou?r is colou{0,1}r. Both POSIX BRE and GNU BRE are incompatible with the other. Curly braces need backslashes to convey the particular meaning of the following flavours: colou\{0,1\}r

Important Regex Concept: Greediness

The first greedy metacharacter that this course introduces is the question mark. The regex engine has two options in response to the question mark: either try to match the portion to which the question mark pertains, or don't try to match it. The engine always attempts to match that component. The engine won't attempt to ignore the portion the question mark applies to unless this causes the full regular expression to fail.

As a result, when is the regex Feb 23(rd)? Is applied to the text Today is Feb 23, 2003, the match is always Feb 23rd and never Feb 23. By adding another question mark after the first, you may make the question mark lazy (i.e., turn off greediness).

Syntax Used

re.findall(): The re.findall(pattern, string) method finds all pattern
occurrences in the string and returns a list of all matching substrings.

The first parameter is the regular expression pattern "aa[cde]?". The string to
be checked for patterns is the second parameter. Simply put, you're looking for
patterns that begin with two 'a' characters and one optional character that
might be a 'c', 'd', or 'e.

Example

#importing re functions import re #findall function to result1 = re.findall('aa[cde]?', 'aacde aa aadcde') #The re.findall(pattern, string) method finds all pattern occurrences in the string and returns a list of all matching substrings. result2 = re.findall('aa?', 'accccacccac') result3 = re.findall('[cd]?[cde]?', 'ccc dd ee') #printing the results print(result1) print(result2) print(result3)

Output

['aac', 'aa', 'aad']
['a', 'a', 'a']
['cc', 'c', '', 'dd', '', 'e', 'e', '']

Code Explanation

Three substrings that match are returned by the findall() method −

First, the pattern is met with the string "aac." After Python eats the matching substring, the substring remains "de aa aadcde." Additionally, the string "aa" fits the pattern. It is consumed by Python, leaving only the substring "aadcde" behind. Third, the pattern in the last substring matches the string "aad". What is left is "cde," which no longer has a matching substring.

Looking Inside The Regex Engine

Let’s apply the regular expression colou?r to the string. The colonel likes the colour green.

The literal c is the first token in the regex. The c in colonel is the first place where it correctly matches. The engine keeps running and discovers that l matches l, another o matches o, and o matches o. The engine then determines if u and n are equal. It fails. The question mark, however, instructs the regex engine that missing the character u is allowed. As a result, the engine moves on to the next regex token, r. However, this also fails to match n. Now, the engine can only conclude that the complete regular expression, beginning with the c in the colonel, cannot be matched to match c to the first o in the colonel, the engine restarts.

o, l, and o match the following characters after a string of failures, and c matches the colour of the c. The engine now determines if u and r match. It fails. Again, no issue. The engine may continue with r because of the question mark. The engine states that the regex successfully matched the colour in our text since it matches r.

Conclusion

Python's A? quantifier matches zero or one instance of A when applied to a regular expression A. The regular phrase "hey?", for instance, matches the strings "he" and "hey," but not the empty string "." This is the case because the? quantifier only applies to the regex that comes before it, 'y,' not to the whole regex ', hey'.

Updated on: 02-Nov-2023

2K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements