Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How can Tensorflow text be used with whitespace tokenizer in Python?
TensorFlow Text provides the WhitespaceTokenizer for splitting text based on whitespace characters. This tokenizer creates tokens by breaking strings at spaces, tabs, and newlines, making it useful for basic text preprocessing tasks.
Read More: What is TensorFlow and how Keras work with TensorFlow to create Neural Networks?
Installing TensorFlow Text
First, you need to install TensorFlow Text alongside TensorFlow ?
pip install tensorflow-text
Basic WhitespaceTokenizer Usage
The WhitespaceTokenizer splits text at whitespace boundaries ?
import tensorflow as tf
import tensorflow_text as text
print("Creating WhitespaceTokenizer")
tokenizer = text.WhitespaceTokenizer()
# Tokenize sample text
sample_text = ['Everything not saved will be lost.', 'Hello world']
tokens = tokenizer.tokenize(sample_text)
print("Tokens:")
print(tokens.to_list())
Creating WhitespaceTokenizer Tokens: [[b'Everything', b'not', b'saved', b'will', b'be', b'lost.'], [b'Hello', b'world']]
Working with N-grams
Tokenization is often combined with n-gram generation for sequence modeling. N-grams are sequential word combinations using a sliding window ?
import tensorflow as tf
import tensorflow_text as text
print("Whitespace tokenizer with n-grams")
tokenizer = text.WhitespaceTokenizer()
tokens = tokenizer.tokenize(['Everything not saved will be lost.', 'Machine learning rocks'])
print("Generating bigrams (n=2)")
bigrams = text.ngrams(tokens, 2, reduction_type=text.Reduction.STRING_JOIN)
print("Bigrams:")
print(bigrams.to_list())
Whitespace tokenizer with n-grams Generating bigrams (n=2) Bigrams: [[b'Everything not', b'not saved', b'saved will', b'will be', b'be lost.'], [b'Machine learning', b'learning rocks']]
Key Features
| Feature | Description | Use Case |
|---|---|---|
| WhitespaceTokenizer | Splits on whitespace characters | Basic text preprocessing |
| N-gram generation | Creates sequential word combinations | Language modeling |
| Batch processing | Handles multiple strings at once | Efficient processing |
Reduction Types
TensorFlow Text supports different reduction mechanisms for combining tokens ?
- STRING_JOIN: Concatenates strings with a separator (default: space)
- SUM: Sums numerical values
- MEAN: Calculates mean of numerical values
Conclusion
TensorFlow Text's WhitespaceTokenizer provides an efficient way to split text into tokens based on whitespace. Combined with n-gram generation, it becomes a powerful tool for text preprocessing in machine learning pipelines.
