Article Categories

Selected Reading

How can Tensorflow text be used with whitespace tokenizer in Python?

Tensorflow Python Server Side Programming Programming

TensorFlow Text provides the WhitespaceTokenizer for splitting text based on whitespace characters. This tokenizer creates tokens by breaking strings at spaces, tabs, and newlines, making it useful for basic text preprocessing tasks.

Installing TensorFlow Text

First, you need to install TensorFlow Text alongside TensorFlow ?

pip install tensorflow-text

Basic WhitespaceTokenizer Usage

The WhitespaceTokenizer splits text at whitespace boundaries ?

import tensorflow as tf
import tensorflow_text as text

print("Creating WhitespaceTokenizer")
tokenizer = text.WhitespaceTokenizer()

# Tokenize sample text
sample_text = ['Everything not saved will be lost.', 'Hello world']
tokens = tokenizer.tokenize(sample_text)

print("Tokens:")
print(tokens.to_list())

Creating WhitespaceTokenizer
Tokens:
[[b'Everything', b'not', b'saved', b'will', b'be', b'lost.'], [b'Hello', b'world']]

Working with N-grams

Tokenization is often combined with n-gram generation for sequence modeling. N-grams are sequential word combinations using a sliding window ?

import tensorflow as tf
import tensorflow_text as text

print("Whitespace tokenizer with n-grams")
tokenizer = text.WhitespaceTokenizer()
tokens = tokenizer.tokenize(['Everything not saved will be lost.', 'Machine learning rocks'])

print("Generating bigrams (n=2)")
bigrams = text.ngrams(tokens, 2, reduction_type=text.Reduction.STRING_JOIN)
print("Bigrams:")
print(bigrams.to_list())

Whitespace tokenizer with n-grams
Generating bigrams (n=2)
Bigrams:
[[b'Everything not', b'not saved', b'saved will', b'will be', b'be lost.'], [b'Machine learning', b'learning rocks']]

Key Features

Feature	Description	Use Case
WhitespaceTokenizer	Splits on whitespace characters	Basic text preprocessing
N-gram generation	Creates sequential word combinations	Language modeling
Batch processing	Handles multiple strings at once	Efficient processing

Reduction Types

TensorFlow Text supports different reduction mechanisms for combining tokens ?

STRING_JOIN: Concatenates strings with a separator (default: space)
SUM: Sums numerical values
MEAN: Calculates mean of numerical values

Conclusion

TensorFlow Text's WhitespaceTokenizer provides an efficient way to split text into tokens based on whitespace. Combined with n-gram generation, it becomes a powerful tool for text preprocessing in machine learning pipelines.

AmitDiwan

Updated on: 2026-03-25T16:36:26+05:30

399 Views

Previous Next