5 Simple Ways to Perform Tokenization in Python



Tokenization is the process of splitting a string into smaller pieces (tokens). In the context of natural language processing (NLP), tokens are words, punctuation marks, and numbers. Tokenization is a preprocessing step for many NLP tasks, as it allows you to work with individual words and symbols rather than raw text.

In this article, we will look at five ways to perform tokenization in Python:

  • Using the split() Method
  • Using the NLTK Library
  • Using Regular Expressions
  • Using the shlex Module

Using the split() Method

The split() method is a built-in function of Python's str class that allows you to split a string into a list of substrings based on a specified delimiter. One limitation of the split() method is that, using this method, you can split a string based on a fixed set of delimiters.

Example

Here's an example of how to use it:

text = "This is a sample text"
tokens = text.split(" ")
print(tokens)

This code will split the string text on the space character, and the resulting tokens will be:

['This', 'is', 'a', 'sample', 'text']

Using the nltk Library

The Natural Language Toolkit (NLTK) is a Python library used for working with human language data. You can use the following built-in functions and classes of the NTLK library to split strings into tokens:

  • word_tokenize() function
  • sent_tokenize() function
  • TweetTokenizer class

To use the NLTK library, you will need to install it first. You can do this by running the following command:

pip install nltk

Let us go through some examples.

Example: Using word_tokenize() Function

The word_tokenize() function accepts a text to split into words. In the following Python program, we use this function to split a string into tokens based on empty spaces after words:

import nltk
text = "This is a sample text"
tokens = nltk.word_tokenize(text)
print(tokens)

This will produce the same result as the split() method above:

['This', 'is', 'a', 'sample', 'text']

Example: Using sent_tokenize() Function

The sent_tokenize() function accepts a long paragraph and returns a copy of it after tokenizing it into sentences. Let's see an example:

from nltk.tokenize import sent_tokenize
# Define the text to be tokenized
text = "This is an example sentence for tokenization. And this is another sentence"
# Tokenize the text into sentences
sentences = sent_tokenize(text)
print(sentences)

This will show a list of sentences:

['This is an example sentence for tokenization.', 'And this is another sentence']

Example: Using TweetTokenizer Class

NLTK library also provides a class named TweetTokenizer, which is specifically used for tokenizing tweets (short text messages on the social media platform Twitter).

It works with the specific features of tweets, such as hashtags, mentions, and emoticons. Here is an example of how to use the TweetTokenizer:

import nltk 
# Download the NLTK tokenizer 
nltk.download('punkt')
from nltk.tokenize import TweetTokenizer
# Define the text to be tokenized
tweet = "This is an example tweet with #hashtag and @mention. ?"
# Create a TweetTokenizer object
tokenizer = TweetTokenizer()
# Tokenize the text
tokens = tokenizer.tokenize(tweet)
print(tokens)

It will produce the following output:

['This', 'is', 'an', 'example', 'tweet', 'with', '#hashtag', 'and', '@mention', '?']

As you can see, the TweetTokenizer not only tokenizes the text into individual words, but also keeps hashtags and mentions as separate tokens. In addition, it can handle emojis, emoticons, and other special characters that are commonly used in tweets.

Using Regular Expressions

Regular expression is a sequence of characters that is used for matching and manipulating strings. To use regular expressions in Python, we need to import the re module.

The Python re module provides a function called split(), which accepts a string and a pattern as parameters. It divides the string into substrings or tokens based on the specified pattern.

Example

Let's see an example of using regular expressions to perform tokenization in Python:

import re

text = "This is a sample text"
# Split on one or more whitespace characters
pattern = r"\s+"
tokens = re.split(pattern, text)
print(tokens)

# Split on words (any sequence of characters that are not whitespace)
pattern = r"\S+"
tokens = re.split(pattern, text)
print(tokens)

# Split on numbers (any sequence of digits)
pattern = r"\d+"
tokens = re.split(pattern, text)
print(tokens)

When you run this code, it will produce the following output:

['This', 'is', 'a', 'sample', 'text']
['', ' ', ' ', ' ', ' ', '']
['This is a sample text']

In this code, we have three sections:

  • The first section uses a regular expression pattern that matches one or more whitespace characters, and the resulting tokens are the words in the string.

  • The second section uses a regular expression pattern that matches any sequence of characters that are not whitespace, resulting in a list of individual characters.

  • The third section uses a regular expression pattern that matches any sequence of digits, and the resulting tokens are the words and punctuation in the string.

Using the shlex Module

The shlex module provides a built-in function named split() that can be used to split a string into tokens. This function accepts a string and splits it into tokens. To use the shlex module, you will need to import it first.

Example

The following program shows how to use shlex.split() function for tokenization:

import shlex

text = "This is a sample text"
tokens = shlex.split(text)
print(tokens)

When you run this code, it will produce the following output -

['This', 'is', 'a', 'sample', 'text']

Conclusion

In this article, we looked at five ways to perform tokenization in Python: using the split() method, the nltk library, regular expressions, the string module, and the shlex module.

Updated on: 2025-07-12T23:42:28+05:30

5K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements