How to escape any special character in Python regular expression?


Regex, often known as regexp, is a potent tool for finding and manipulating text strings, especially when processing text files. Regex may easily replace many hundred lines of computer code with only one line.

All scripting languages, including Perl, Python, PHP, JavaScript, general-purpose programming languages like Java, and even word processors like Word, support Regex for text searching. Regex may be challenging to learn because of its complicated syntax, but it is time well spent.

Special Characters

Text processing becomes more challenging when special characters are included because context must be carefully considered. You must think about what you see, what Python sees, and what the regular expression engine sees when looking at Python code that uses a regular expression. Python may consider a character unique, whereas regular expressions may not, and vice versa.

This article describes a specific example of handling special characters in various situations.

Getting around certain TeX characters

We recently required a regular expression to escape TeX's unique characters. For TeX to recognise the underscore as a literal underscore and a subscript command, we need to turn text like ICD9 CODE into ICD9 CODE.

TeX has more memorable characters than the underscore. It has 10 unique characters −

\ { } $ & # ^ _% ~

Because they are ubiquitous in everyday speech, $ and % are perhaps the two people trip over the most. Importing a per cent sign without escaping it will fail silently because, in TeX, % denotes the start of a remark. The outcome is syntactically correct. Simply said, it ends the rest of the sentence.

A regular expression's backslash () denotes one of the following −

  • According to the table in the next section, the character that follows it is unique. For instance, the symbols b, t, and x020 indicate that a regular expression match should start on a word boundary, a tab, and a space, respectively.

  • Any character that would otherwise be understood as a linguistic construct should be taken literally. For instance, the definition of a quantifier begins with a brace (), but the regular expression engine should match the brace if it is followed by a backslash (). Similarly, a backslash (/) denotes the start of an escaped language construct, but two backslashes () suggest that the regular expression engine should match the backslash.

Raw Strings

Something intriguing is occurring in this situation. The majority of TeX-specific characters are not Python-specific. Backlash, though, is unique to both. Regular expressions also have a particular case for backslashes. Python is informed by the r prefix in front of the quotes that this is a "raw" string and that backslashes should not be treated differently. It expresses the desire for a string that starts with two backslashes.

Why use two backslashes? Why not just do it? Backslashes are special in regular expressions, which is where we will employ this string. Soon, more on it.

Solution

For regular expression patterns, the answer is to utilise Python's raw string notation; backslashes are not treated differently in a string literal prefixed with "r." Therefore, "r"\n" is a two-character string made up of the letters "" and "n," whereas "\n" is a one-character string containing a new line. This raw string notation is typically used in Python programmes to express patterns.

Syntax

line = r"\String"

r”\String” reads the sentence as the raw string, and "\r" is a carriage return.

Example 1

#importing re
import re
#using escape method to escape special character
re. (r'\ a.*$')
'\\ a\.\*\$'
#printing the escaped character
print(re.escape(r'\ a.*$'))

Output

\ a\.\*\$

Example 2

#importing re
import re
#using escape method to escape special character
re.escape('www.stackoverflow.com')
'www\.stackoverflow\.com'
#printing the escaped character
print(re.escape('www.stackoverflow.com'))

Output

www\.stackoverflow\.com

Code Explanation

  • Our expression of gazing backwards is compounded by the fact that we are searching for a unique character. A backslash, a unique character for regular expressions, is what we're seeking.

  • We search for our unique characters after ensuring there isn't a backslash and checking behind for one. For the regular expression engine to recognise two backslashes and interpret them as one literal backslash, we used two backslashes while declaring the variable special.

  • We wish to instruct re.sub to precede the initial capture with a backslash. We transmit it \ to represent a literal backslash since the regular expression engine treats backslashes differently. The outcome is the same as before when we follow this with \1 for the initial capture.

Conclusion

Backslash () serves two functions in Regex: in the case of metacharacters like d (digit), D (non-digit), s (space), S (non-space), w (word), and W (non-word). To avoid using special regex letters, such as. for., + for +, * for *, and? for? In regex, you must additionally use the word "for" to prevent ambiguity. Additionally, Regex understands n for newline, t for tab, etc. Be aware that the backslash character () is also used for escape sequences in strings in Python. For example, "\n" stands for a new line, "\t" stands for a tab, and you must also write "" for \. Consequently, in these languages, you must write "" (two levels of escape!!!) to write the regex pattern (which matches one ). The equivalent is "\d" for the regex metacharacter \d.

Updated on: 04-Apr-2023

928 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements