
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
5 Simple Ways to Perform Tokenization in Python
Tokenization is the process of splitting a string into smaller pieces (tokens). In the context of natural language processing (NLP), tokens are words, punctuation marks, and numbers. Tokenization is a preprocessing step for many NLP tasks, as it allows you to work with individual words and symbols rather than raw text.
In this article, we will look at five ways to perform tokenization in Python:
- Using the split() Method
- Using the NLTK Library
- Using Regular Expressions
- Using the shlex Module
Using the split() Method
The split() method is a built-in function of Python's str class that allows you to split a string into a list of substrings based on a specified delimiter. One limitation of the split() method is that, using this method, you can split a string based on a fixed set of delimiters.
Example
Here's an example of how to use it:
text = "This is a sample text" tokens = text.split(" ") print(tokens)
This code will split the string text on the space character, and the resulting tokens will be:
['This', 'is', 'a', 'sample', 'text']
Using the nltk Library
The Natural Language Toolkit (NLTK) is a Python library used for working with human language data. You can use the following built-in functions and classes of the NTLK library to split strings into tokens:
- word_tokenize() function
- sent_tokenize() function
- TweetTokenizer class
To use the NLTK library, you will need to install it first. You can do this by running the following command:
pip install nltk
Let us go through some examples.
Example: Using word_tokenize() Function
The word_tokenize() function accepts a text to split into words. In the following Python program, we use this function to split a string into tokens based on empty spaces after words:
import nltk text = "This is a sample text" tokens = nltk.word_tokenize(text) print(tokens)
This will produce the same result as the split() method above:
['This', 'is', 'a', 'sample', 'text']
Example: Using sent_tokenize() Function
The sent_tokenize() function accepts a long paragraph and returns a copy of it after tokenizing it into sentences. Let's see an example:
from nltk.tokenize import sent_tokenize # Define the text to be tokenized text = "This is an example sentence for tokenization. And this is another sentence" # Tokenize the text into sentences sentences = sent_tokenize(text) print(sentences)
This will show a list of sentences:
['This is an example sentence for tokenization.', 'And this is another sentence']
Example: Using TweetTokenizer Class
NLTK library also provides a class named TweetTokenizer, which is specifically used for tokenizing tweets (short text messages on the social media platform Twitter).
It works with the specific features of tweets, such as hashtags, mentions, and emoticons. Here is an example of how to use the TweetTokenizer:
import nltk # Download the NLTK tokenizer nltk.download('punkt') from nltk.tokenize import TweetTokenizer # Define the text to be tokenized tweet = "This is an example tweet with #hashtag and @mention. ?" # Create a TweetTokenizer object tokenizer = TweetTokenizer() # Tokenize the text tokens = tokenizer.tokenize(tweet) print(tokens)
It will produce the following output:
['This', 'is', 'an', 'example', 'tweet', 'with', '#hashtag', 'and', '@mention', '?']
As you can see, the TweetTokenizer not only tokenizes the text into individual words, but also keeps hashtags and mentions as separate tokens. In addition, it can handle emojis, emoticons, and other special characters that are commonly used in tweets.
Using Regular Expressions
Regular expression is a sequence of characters that is used for matching and manipulating strings. To use regular expressions in Python, we need to import the re module.
The Python re module provides a function called split(), which accepts a string and a pattern as parameters. It divides the string into substrings or tokens based on the specified pattern.
Example
Let's see an example of using regular expressions to perform tokenization in Python:
import re text = "This is a sample text" # Split on one or more whitespace characters pattern = r"\s+" tokens = re.split(pattern, text) print(tokens) # Split on words (any sequence of characters that are not whitespace) pattern = r"\S+" tokens = re.split(pattern, text) print(tokens) # Split on numbers (any sequence of digits) pattern = r"\d+" tokens = re.split(pattern, text) print(tokens)
When you run this code, it will produce the following output:
['This', 'is', 'a', 'sample', 'text'] ['', ' ', ' ', ' ', ' ', ''] ['This is a sample text']
In this code, we have three sections:
-
The first section uses a regular expression pattern that matches one or more whitespace characters, and the resulting tokens are the words in the string.
-
The second section uses a regular expression pattern that matches any sequence of characters that are not whitespace, resulting in a list of individual characters.
-
The third section uses a regular expression pattern that matches any sequence of digits, and the resulting tokens are the words and punctuation in the string.
Using the shlex Module
The shlex module provides a built-in function named split() that can be used to split a string into tokens. This function accepts a string and splits it into tokens. To use the shlex module, you will need to import it first.
Example
The following program shows how to use shlex.split() function for tokenization:
import shlex text = "This is a sample text" tokens = shlex.split(text) print(tokens)
When you run this code, it will produce the following output -
['This', 'is', 'a', 'sample', 'text']
Conclusion
In this article, we looked at five ways to perform tokenization in Python: using the split() method, the nltk library, regular expressions, the string module, and the shlex module.