How can Tensorflow text be used with whitespace tokenizer in Python?

TensorFlow Text provides the WhitespaceTokenizer for splitting text based on whitespace characters. This tokenizer creates tokens by breaking strings at spaces, tabs, and newlines, making it useful for basic text preprocessing tasks.

Read More: What is TensorFlow and how Keras work with TensorFlow to create Neural Networks?

Installing TensorFlow Text

First, you need to install TensorFlow Text alongside TensorFlow ?

pip install tensorflow-text

Basic WhitespaceTokenizer Usage

The WhitespaceTokenizer splits text at whitespace boundaries ?

import tensorflow as tf
import tensorflow_text as text

print("Creating WhitespaceTokenizer")
tokenizer = text.WhitespaceTokenizer()

# Tokenize sample text
sample_text = ['Everything not saved will be lost.', 'Hello world']
tokens = tokenizer.tokenize(sample_text)

print("Tokens:")
print(tokens.to_list())
Creating WhitespaceTokenizer
Tokens:
[[b'Everything', b'not', b'saved', b'will', b'be', b'lost.'], [b'Hello', b'world']]

Working with N-grams

Tokenization is often combined with n-gram generation for sequence modeling. N-grams are sequential word combinations using a sliding window ?

import tensorflow as tf
import tensorflow_text as text

print("Whitespace tokenizer with n-grams")
tokenizer = text.WhitespaceTokenizer()
tokens = tokenizer.tokenize(['Everything not saved will be lost.', 'Machine learning rocks'])

print("Generating bigrams (n=2)")
bigrams = text.ngrams(tokens, 2, reduction_type=text.Reduction.STRING_JOIN)
print("Bigrams:")
print(bigrams.to_list())
Whitespace tokenizer with n-grams
Generating bigrams (n=2)
Bigrams:
[[b'Everything not', b'not saved', b'saved will', b'will be', b'be lost.'], [b'Machine learning', b'learning rocks']]

Key Features

Feature Description Use Case
WhitespaceTokenizer Splits on whitespace characters Basic text preprocessing
N-gram generation Creates sequential word combinations Language modeling
Batch processing Handles multiple strings at once Efficient processing

Reduction Types

TensorFlow Text supports different reduction mechanisms for combining tokens ?

  • STRING_JOIN: Concatenates strings with a separator (default: space)
  • SUM: Sums numerical values
  • MEAN: Calculates mean of numerical values

Conclusion

TensorFlow Text's WhitespaceTokenizer provides an efficient way to split text into tokens based on whitespace. Combined with n-gram generation, it becomes a powerful tool for text preprocessing in machine learning pipelines.

Updated on: 2026-03-25T16:36:26+05:30

369 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements