- Trending Categories
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
Physics
Chemistry
Biology
Mathematics
English
Economics
Psychology
Social Studies
Fashion Studies
Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
5 Simple Ways to Perform Tokenization in Python
Tokenization is the process of splitting a string into tokens, or "smaller pieces". In the context of natural language processing (NLP), tokens are usually words, punctuation marks, and numbers. Tokenization is an important preprocessing step for many NLP tasks, as it allows you to work with individual words and symbols rather than raw text.
In this article, we'll look at five ways to perform tokenization in Python. We'll start with the most simple method, using the split() function, and then move on to more advanced techniques using libraries and modules such as nltk, re, string, and shlex.
Using the split() Method
The split() method is a built-in function of Python's str class that allows you to split a string into a list of substrings based on a specified delimiter. Here's an example of how to use it −
text = "This is a sample text" tokens = text.split(" ") print(tokens)
This code will split the string text on the space character, and the resulting tokens will be
['This', 'is', 'a', 'sample', 'text'].
You can also specify multiple delimiters by passing a list of strings to the split() method. For example −
text = "This is a sample, text with punctuation!" tokens = text.split([" ", ",", "!"]) print(tokens)
This will split the string text on spaces, commas, and exclamation points, resulting in the tokens ['This', 'is', 'a', 'sample', '', 'text', 'with', 'punctuation', '']. Notice that the delimiters are also included in the list of tokens, as empty strings.
One limitation of the split() method is that it only allows you to split a string based on a fixed set of delimiters. If you want to split a string on more complex patterns, such as words or numbers, you'll need to use a more advanced technique.
Using the nltk library
The Natural Language Toolkit (nltk) is a popular Python library for working with human language data. It provides several tokenization functions that can be used to split strings into tokens based on various criteria.
To use the nltk library, you'll need to install it first. You can do this by running the following command −
pip install nltk
Once you have nltk installed, you can use the word_tokenize() function to split a string into tokens based on word boundaries −
import nltk text = "This is a sample text" tokens = nltk.word_tokenize(text) print(tokens)
This will produce the same result as the split() method above.
The nltk library also provides a number of other tokenization functions, such as sent_tokenize(), which tokenize a text into sentences.
Example
Let's see an example −
from nltk.tokenize import sent_tokenize # Define the text to be tokenized text = "This is an example sentence for tokenization. And this is another sentence" # Tokenize the text into sentences sentences = sent_tokenize(text) print(sentences)
Output
This will output a list of sentences −
['This is an example sentence for tokenization.', 'And this is another sentence']
Example
We can also tokenize the text using the word_tokenize() method from the nltk.tokenize module as follows −
from nltk.tokenize import word_tokenize # Define the text to be tokenized text = "This is an example sentence for tokenization." # Tokenize the text into words words = word_tokenize(text) print(words)
Output
This will also output a list of words −
['This', 'is', 'an', 'example', 'sentence', 'for', 'tokenization', '.']
As you can see, the word_tokenize() method tokenizes the text into individual words, just like the nltk.word_tokenize() method.
Example
NLTK library also provides a class named TweetTokenizer, which is specifically designed for tokenizing tweets (short text messages on the social media platform Twitter). It works similarly to the word_tokenize() method, but it takes into account the specific features of tweets, such as hashtags, mentions, and emoticons.
Here is an example of how to use the TweetTokenizer −
import nltk # Download the NLTK tokenizer nltk.download('punkt') from nltk.tokenize import TweetTokenizer # Define the text to be tokenized tweet = "This is an example tweet with #hashtag and @mention. 😊" # Create a TweetTokenizer object tokenizer = TweetTokenizer() # Tokenize the text tokens = tokenizer.tokenize(tweet) print(tokens)
Output
It will produce the following output −
['This', 'is', 'an', 'example', 'tweet', 'with', '#hashtag', 'and', '@mention', '😊']
As you can see, the TweetTokenizer not only tokenizes the text into individual words, but also preserves hashtags and mentions as separate tokens. Also, it can handle emojis, emoticons and other special characters that are commonly used in tweets.
This can be useful if you are working with twitter data and want to analyze specific aspects of tweets such as hashtags and mentions.
Using regular expressions
Regular expressions are a powerful tool for matching and manipulating strings, and they can be used to perform a wide variety of tokenization tasks.
Example
Let's see an example of using regular expressions to perform tokenization in Python −
import re text = "This is a sample text" # Split on one or more whitespace characters pattern = r"\s+" tokens = re.split(pattern, text) print(tokens) # Split on words (any sequence of characters that are not whitespace) pattern = r"\S+" tokens = re.split(pattern, text) print(tokens) # Split on numbers (any sequence of digits) pattern = r"\d+" tokens = re.split(pattern, text) print(tokens)
In this code, we have three sections −
The first section uses a regular expression pattern that matches one or more whitespace characters, and the resulting tokens are the words in the string.
The second section uses a regular expression pattern that matches any sequence of characters that are not whitespace, resulting in a list of individual characters.
The third section uses a regular expression pattern that matches any sequence of digits, and the resulting tokens are the words and punctuation in the string.
Output
When you run this code, it will produce the following output −
['This', 'is', 'a', 'sample', 'text'] ['', ' ', ' ', ' ', ' ', ''] ['This is a sample text']
Using the string module
The string module in Python provides a number of string processing functions, including a Template class that can be used to tokenize a string.
To use the Template class, you'll need to import the string module and define a template string with placeholders for the tokens you want to extract. For example −
import string text = "This is a $token text" template = string.Template(text)
You can then use the substitute() method to replace the placeholders with actual values and split the resulting string on the space character −
tokens = template.substitute({"token": "sample"}).split(" ") print(tokens)
This will replace the placeholder $token with the word "sample" and split the resulting string on the space character, resulting in the tokens ['This', is', 'a', 'sample', 'text'].
The Template class is useful for tokenizing strings with variable values, such as template emails or messages.
Using the shlex module
The shlex module provides a lexical analyzer for shell-style syntax. It can be used to split a string into tokens in a way that is like how the shell does it.
To use the shlex module, you'll need to import it first −
import shlex text = "This is a sample text" tokens = shlex.split(text) print(tokens)
This will split the string on space characters, just like the split() method and the nltk library. The shlex module is useful for tokenizing strings with shell-style syntax, such as command-line arguments.
Output
When you run this code, it will produce the following output −
['This', 'is', 'a', 'sample', 'text']
Conclusion
Tokenization is the process of splitting a string into smaller pieces, or tokens. In the context of natural language processing, tokens are usually words, punctuation marks, and numbers. Tokenization is an important preprocessing step for many NLP tasks, as it allows you to work with individual words and symbols rather than raw text.
In this tutorial, we looked at five ways to perform tokenization in Python: using the split() method, the nltk library, regular expressions, the string module, and the shlex module. Each of these methods have their own advantages and limitations, so it's important to choose the one that best fits your needs. Whether you're working with simple strings or complex human language data, Python provides a range of tools and libraries that you can use to tokenize your text effectively.