5 Simple Ways to Perform Tokenization in Python


Tokenization is the process of splitting a string into tokens, or "smaller pieces". In the context of natural language processing (NLP), tokens are usually words, punctuation marks, and numbers. Tokenization is an important preprocessing step for many NLP tasks, as it allows you to work with individual words and symbols rather than raw text.

In this article, we'll look at five ways to perform tokenization in Python. We'll start with the most simple method, using the split() function, and then move on to more advanced techniques using libraries and modules such as nltk, re, string, and shlex.

Using the split() Method

The split() method is a built-in function of Python's str class that allows you to split a string into a list of substrings based on a specified delimiter. Here's an example of how to use it −

text = "This is a sample text"
tokens = text.split(" ")
print(tokens)

This code will split the string text on the space character, and the resulting tokens will be

['This', 'is', 'a', 'sample', 'text'].

You can also specify multiple delimiters by passing a list of strings to the split() method. For example −

text = "This is a sample, text with punctuation!"
tokens = text.split([" ", ",", "!"])
print(tokens)

This will split the string text on spaces, commas, and exclamation points, resulting in the tokens ['This', 'is', 'a', 'sample', '', 'text', 'with', 'punctuation', '']. Notice that the delimiters are also included in the list of tokens, as empty strings.

One limitation of the split() method is that it only allows you to split a string based on a fixed set of delimiters. If you want to split a string on more complex patterns, such as words or numbers, you'll need to use a more advanced technique.

Using the nltk library

The Natural Language Toolkit (nltk) is a popular Python library for working with human language data. It provides several tokenization functions that can be used to split strings into tokens based on various criteria.

To use the nltk library, you'll need to install it first. You can do this by running the following command −

pip install nltk

Once you have nltk installed, you can use the word_tokenize() function to split a string into tokens based on word boundaries −

import nltk
text = "This is a sample text"
tokens = nltk.word_tokenize(text)
print(tokens)

This will produce the same result as the split() method above.

The nltk library also provides a number of other tokenization functions, such as sent_tokenize(), which tokenize a text into sentences.

Example

Let's see an example −

from nltk.tokenize import sent_tokenize

# Define the text to be tokenized
text = "This is an example sentence for tokenization. And this is another sentence"

# Tokenize the text into sentences
sentences = sent_tokenize(text)

print(sentences)

Output

This will output a list of sentences −

['This is an example sentence for tokenization.', 'And this is another sentence']

Example

We can also tokenize the text using the word_tokenize() method from the nltk.tokenize module as follows −

from nltk.tokenize import word_tokenize
# Define the text to be tokenized
text = "This is an example sentence for tokenization."
# Tokenize the text into words
words = word_tokenize(text)
print(words)

Output

This will also output a list of words −

['This', 'is', 'an', 'example', 'sentence', 'for', 'tokenization', '.']

As you can see, the word_tokenize() method tokenizes the text into individual words, just like the nltk.word_tokenize() method.

Example

NLTK library also provides a class named TweetTokenizer, which is specifically designed for tokenizing tweets (short text messages on the social media platform Twitter). It works similarly to the word_tokenize() method, but it takes into account the specific features of tweets, such as hashtags, mentions, and emoticons.

Here is an example of how to use the TweetTokenizer −

import nltk 

# Download the NLTK tokenizer 
nltk.download('punkt')

from nltk.tokenize import TweetTokenizer

# Define the text to be tokenized
tweet = "This is an example tweet with #hashtag and @mention. 😊"

# Create a TweetTokenizer object
tokenizer = TweetTokenizer()

# Tokenize the text
tokens = tokenizer.tokenize(tweet)
print(tokens)

Output

It will produce the following output −

['This', 'is', 'an', 'example', 'tweet', 'with', '#hashtag', 'and', '@mention', '😊']

As you can see, the TweetTokenizer not only tokenizes the text into individual words, but also preserves hashtags and mentions as separate tokens. Also, it can handle emojis, emoticons and other special characters that are commonly used in tweets.

This can be useful if you are working with twitter data and want to analyze specific aspects of tweets such as hashtags and mentions.

Using regular expressions

Regular expressions are a powerful tool for matching and manipulating strings, and they can be used to perform a wide variety of tokenization tasks.

Example

Let's see an example of using regular expressions to perform tokenization in Python −

import re

text = "This is a sample text"

# Split on one or more whitespace characters
pattern = r"\s+"
tokens = re.split(pattern, text)
print(tokens)

# Split on words (any sequence of characters that are not whitespace)
pattern = r"\S+"
tokens = re.split(pattern, text)
print(tokens)

# Split on numbers (any sequence of digits)
pattern = r"\d+"
tokens = re.split(pattern, text)
print(tokens)

In this code, we have three sections −

  • The first section uses a regular expression pattern that matches one or more whitespace characters, and the resulting tokens are the words in the string.

  • The second section uses a regular expression pattern that matches any sequence of characters that are not whitespace, resulting in a list of individual characters.

  • The third section uses a regular expression pattern that matches any sequence of digits, and the resulting tokens are the words and punctuation in the string.

Output

When you run this code, it will produce the following output −

['This', 'is', 'a', 'sample', 'text']
['', ' ', ' ', ' ', ' ', '']
['This is a sample text']

Using the string module

The string module in Python provides a number of string processing functions, including a Template class that can be used to tokenize a string.

To use the Template class, you'll need to import the string module and define a template string with placeholders for the tokens you want to extract. For example −

import string
text = "This is a $token text"
template = string.Template(text)

You can then use the substitute() method to replace the placeholders with actual values and split the resulting string on the space character −

tokens = template.substitute({"token": "sample"}).split(" ")
print(tokens)

This will replace the placeholder $token with the word "sample" and split the resulting string on the space character, resulting in the tokens ['This', is', 'a', 'sample', 'text'].

The Template class is useful for tokenizing strings with variable values, such as template emails or messages.

Using the shlex module

The shlex module provides a lexical analyzer for shell-style syntax. It can be used to split a string into tokens in a way that is like how the shell does it.

To use the shlex module, you'll need to import it first −

import shlex
text = "This is a sample text"
tokens = shlex.split(text)
print(tokens)

This will split the string on space characters, just like the split() method and the nltk library. The shlex module is useful for tokenizing strings with shell-style syntax, such as command-line arguments.

Output

When you run this code, it will produce the following output −

['This', 'is', 'a', 'sample', 'text']

Conclusion

Tokenization is the process of splitting a string into smaller pieces, or tokens. In the context of natural language processing, tokens are usually words, punctuation marks, and numbers. Tokenization is an important preprocessing step for many NLP tasks, as it allows you to work with individual words and symbols rather than raw text.

In this tutorial, we looked at five ways to perform tokenization in Python: using the split() method, the nltk library, regular expressions, the string module, and the shlex module. Each of these methods have their own advantages and limitations, so it's important to choose the one that best fits your needs. Whether you're working with simple strings or complex human language data, Python provides a range of tools and libraries that you can use to tokenize your text effectively.

Updated on: 21-Aug-2023

614 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements