Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
What are character classes or character sets used in Python regular expression?
In this chapter, we will understand character classes, how they work, and provide simple examples and programs to explain their usage. One of the main components of regular expressions is character classes or character sets.
Character classes help you to define a set of characters that will match in a string. They can be used to define a range of characters or specific characters to consider during a search.
Character Classes or Character Sets
A character class, also known as a character set, helps you inform the regex engine to match only one of numerous characters. Simply enter the characters you want to match in square brackets. If you want to match an 'a' or 'e', type [ae]. This could be used in gr[ae]y to match 'gray' or 'grey' − very helpful if you're dealing with American or British English spelling variations.
A hyphen can be used within a character class to represent a range of characters. [0-9] refers to a single digit between 0 and 9. You can also use multiple ranges. [0-9a-fA-F] represents a single hexadecimal digit, case insensitive. You can combine ranges and individual characters − [0-9a-fxA-FX] represents a hexadecimal number or the letter X.
Character classes are one of the most common types of regular expressions. You can find misspelled terms like sep[ae]r[ae]te or li[cs]en[cs]e. A programming language's identifier can be matched with [A-Za-z_][A-Za-z_0-9]*, and a C-style hexadecimal number with 0[xX][A-Fa-f0-9]+.
Predefined Character Sets
Here are some of the predefined character sets for your reference −
\daccepts any digit (equivalent to[0-9])\Dcorresponds to any non-digit character\wcorresponds to any alphanumeric character (equivalent to[a-zA-Z0-9_])\Wcorresponds to any non-alphanumeric character\smatches all whitespace characters (spaces, tabs and newlines)\Scorresponds to any non-whitespace character
Now we will see some examples to show you how we can use character sets in Python Regular Expressions −
Match Specific Characters
In this example, we have created a regex pattern for matching 'cat', 'rat', or 'mat'. The character class [crm] allows any character from 'c', 'r', or 'm' before 'at' −
import re text = "The cat and rat sat on the mat." pattern = r'[crm]at' matches = re.findall(pattern, text) print(matches)
The output of the above code is −
['cat', 'rat', 'mat']
Matching Vowels
The program below searches for all vowels in the given string and returns a list of those vowels as output −
import re text = "Hello World! Are you there?" pattern = r'[aeiouAEIOU]' matches = re.findall(pattern, text) print(matches)
The output of the above code is −
['e', 'o', 'o', 'A', 'e', 'o', 'u', 'e', 'e']
Find Numeric Characters
In this program, we will extract all numeric characters or digits from the given string. Here we use the \d pattern, which identifies each digit present in the given string −
import re text = "There are 4 apples and 10 oranges." pattern = r'\d' matches = re.findall(pattern, text) print(matches)
The output of the above code is −
['4', '1', '0']
Match Username Pattern
The program below checks if the username contains only letters, numbers, or underscores. We use the character set [a-zA-Z0-9_] to define all allowed characters −
import re
username = "user_123"
pattern = r"^[a-zA-Z0-9_]+$"
match = re.match(pattern, username)
if match:
print("Valid username")
else:
print("Invalid username")
The output of the above code is −
Valid username
Negated Character Classes
You can also create negated character classes using the caret ^ symbol at the beginning of the character class. This matches any character NOT in the specified set −
import re text = "abc123XYZ!@#" pattern = r'[^a-zA-Z]' # Match non-alphabetic characters matches = re.findall(pattern, text) print(matches)
The output of the above code is −
['1', '2', '3', '!', '@', '#']
Conclusion
Character classes in Python regular expressions provide a powerful way to match specific sets of characters. Use square brackets [...] to define custom character sets, leverage predefined classes like \d and \w for common patterns, and use negated classes [^...] to exclude specific characters.
