How to match a whitespace in python using Regular Expression?


Regular expressions and the ‘re’ module are exceptionally useful tools for text and data manipulation in Python. These make it possible for you to specify patterns of characters that you want to match, replace, or extract from a given string. One of the most often used applications of regular expressions is to find and remove whitespace characters, such as spaces, tabs, and newlines, from a text.

It is found that whitespace characters are often not so visible or hard to notice, and yet they can affect the formatting, readability, and functionality of your code. For instance, if you want to undertake a task like you want to split a string into words, you need to know where the spaces are. Similarly, if you want to align text in columns, you need to know how many tabs are used. On the same lines, if you wish to read data from a file, you need to know how the lines are separated.

In this detailed article, we will show you how to match whitespace characters in Python making use of regular expressions. We will use the ‘re’ module; it provides various functions and methods for working with regular expressions. We will also discuss and explain to you a few code examples that demonstrate how to use regular expressions to perform different tasks involving whitespace.

Finding whitespace characters in a string

The easiest way to match whitespace characters in a string is to use the special character class ∖s, which matches any single whitespace character. You can also use the re.findall function to return a list of all the matches in a string, or the re.finditer function to return an iterator of match objects.

Example

Suppose we have the following string:

s = "Hello world!∖nThis is a∖ttest."

We can use the following code to find all the whitespace characters in s:

import re
matches = re.findall("∖s", s)
print(matches)

Output

[' ', '∖n', ' ', ' ', '∖t', ' ']

As you can see, the list contains a space, a newline, two more spaces, a tab, and a space. The default set of characters that ∖s matches includes:

  • Space

  • Tab (∖t)

  • Newline (∖n)

  • Carriage return (∖r)

  • Form feed (∖f)

  • Vertical tab (∖v)

If you want to match only some of these characters, you can use a custom character class with brackets. For example, [∖t∖n] matches only tabs and newlines.

Alternatively, you can use the re.finditer function to get more information about each match, such as its position and span. For example:

Example

import re
matches = re.finditer("∖s", s)
for match in matches:
    print(match)

Output

<re.Match object; span=(5, 6), match=' '>
<re.Match object; span=(12, 13), match='∖n'>
<re.Match object; span=(18, 19), match=' '>
<re.Match object; span=(20, 21), match=' '>
<re.Match object; span=(24, 25), match='∖t'>
<re.Match object; span=(29, 30), match=' '>

Replacing whitespace characters in a string

Another familiar use of regular expressions is to replace whitespace characters in a string with something else. You can use the re.sub function to perform this task. The re.sub function takes three arguments: the pattern to match, the replacement string, and the input string. It returns a new string with the matches replaced.

For example, suppose we want to replace all the whitespace characters in s with underscores (_). We can use the following code:

Example

import re
new_s = re.sub("∖s", "_", s)
print(new_s)

Output

Hello_world!_This_is_a_test._

Note that the period is also replaced by an underscore because it is matched by ∖s. If we want to preserve the period, we can use a custom character class that excludes it. For example:

Example

import re
new_s = re.sub("[∖s&&[^∖.]]", "_", s)
print(new_s)

Output

Hello_world!_This_is_a_test.

The syntax [∖s&&[^∖.]] means "match any character that is in ∖s and not in [∖.]". The [∖.] means "match a literal period". The backslash (∖) is used to escape the period because it has a special meaning in regular expressions.

Splitting a string by whitespace characters

Another useful function provided by the ‘re’ module is re.split, which splits a string by a given pattern and returns a list of substrings.

For example, suppose we want to split s by whitespace characters and get a list of words. We can use the following code:

Example

import re
words = re.split("∖s", s)
print(words)

Output

['Hello', 'world!', 'This', 'is', 'a', 'test.']

Note that the period is also considered as a word because it is matched by ∖s. If we want to exclude it, we can use a custom character class that matches only alphanumeric characters and underscores. For example:

Example

import re
words = re.split("[^∖w]", s)
print(words)

Output

['Hello', 'world', '', 'This', 'is', 'a', 'test', '']

The syntax [^∖w] means "match any character that is not in ∖w". The ∖w means "match any alphanumeric character or underscore". The caret (^) inside the brackets means "negate the character class".

Removing leading and trailing whitespace characters from a string

Oftentimes, you may wish to remove the whitespace characters that appear at the beginning or the end of a string, but not in the middle. This can be helpful for cleaning up user input or data from files. You can make use of the re.sub function with the special anchors ^ and $ to perform this task. The ^ means "match the beginning of the string" and the $ means "match the end of the string".

Example

Suppose we have the following string:

s = "   Hello world!   "

We can use the following code to remove the leading and trailing whitespace characters from s:

import re
new_s = re.sub("^∖s+|∖s+$", "", s)
print(new_s)

Output

Hello world!

The pattern ^∖s+|∖s+$ means "match one or more whitespace characters at the beginning or at the end of the string". The pipe (|) means "or".

Matching multiple whitespace characters as one

At some other times, you may want to treat multiple consecutive whitespace characters as one, and ignore them or replace them with a single character. For example, you may need to normalize the spacing between words in a text or remove extra spaces from a file. In that case, you can use the re.sub function with the special quantifier + to perform this task. The + means "match one or more occurrences of the preceding character".

Example

Suppose we have the following string:

s = "Hello    world!∖nThis  is  a∖t∖ttest."

We can use the following code to replace multiple whitespace characters with a single space in s:

import re
new_s = re.sub("∖s+", " ", s)
print(new_s)

Output

Hello world! This is a test.

The pattern ∖s+ means "match one or more whitespace characters". The replacement string is a single space.

In this article, so far, we have learned how to match whitespace characters in Python using regular expressions. We have also seen how to use the ‘re’ module and its functions, such as re.findall, re.finditer, re.sub, and re.split, to perform various tasks involving whitespace. We have also learned how to use special character classes, such as ∖s and ∖w, and special symbols, such as ^, $, |, and +, to create complex patterns.

Regular expressions, as we have realized by now are very useful for manipulating text and data, but they can also be very deceptive and confusing. It is important to evaluate, examine and test your regular expressions carefully and make sure they match what you expect.

Updated on: 08-Sep-2023

3K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements