Article Categories

Selected Reading

How can tf.text be used to see if a string has a certain property in Python?

Tensorflow Python Server Side Programming Programming

The tf.text.wordshape() method can be used along with specific conditions such as HAS_TITLE_CASE, IS_NUMERIC_VALUE, or HAS_SOME_PUNCT_OR_SYMBOL to see if a string has a particular property. This is useful for text preprocessing and natural language understanding tasks.

TensorFlow Text provides collection of text-related classes and operations that work with TensorFlow 2.0. It includes tokenizers and word shape analysis functions that help identify specific patterns and properties in text data.

What is Word Shape Analysis?

Word shape analysis examines text tokens to identify common properties like capitalization, numeric values, or punctuation. The tf.text.wordshape() function uses regular expression-based helper functions to match various patterns in your input text.

Common Word Shape Properties

HAS_TITLE_CASE ? Checks if text starts with capital letter
IS_UPPERCASE ? Checks if all letters are uppercase
HAS_SOME_PUNCT_OR_SYMBOL ? Checks for punctuation or symbols
IS_NUMERIC_VALUE ? Checks if token represents a number

Example

Here's how to use tf.text to analyze string properties ?

import tensorflow as tf
import tensorflow_text as text

print("Whitespace tokenizer is being called")
tokenizer = text.WhitespaceTokenizer()

print("Tokens being generated")
tokens = tokenizer.tokenize(['Everything that is not saved will be lost.', 'Sad?'.encode('UTF-8')])

print("Checking if it is capitalized")
f1 = text.wordshape(tokens, text.WordShape.HAS_TITLE_CASE)

print("Checking if all the letters are uppercase")
f2 = text.wordshape(tokens, text.WordShape.IS_UPPERCASE)

print("Checking if the tokens contain punctuation")
f3 = text.wordshape(tokens, text.WordShape.HAS_SOME_PUNCT_OR_SYMBOL)

print("Checking if the token is a number")
f4 = text.wordshape(tokens, text.WordShape.IS_NUMERIC_VALUE)

print("Printing the results")
print("Title case:", f1.numpy().tolist())
print("Uppercase:", f2.numpy().tolist())
print("Has punctuation:", f3.numpy().tolist())
print("Is numeric:", f4.numpy().tolist())

Whitespace tokenizer is being called
Tokens being generated
Checking if it is capitalized
Checking if all the letters are uppercase
Checking if the tokens contain punctuation
Checking if the token is a number
Printing the results
Title case: [[True, False, False, False, False, False, False, False], [True]]
Uppercase: [[False, False, False, False, False, False, False, False], [False]]
Has punctuation: [[False, False, False, False, False, False, False, True], [True]]
Is numeric: [[False, False, False, False, False, False, False, False], [False]]

How It Works

The process involves these steps ?

Tokenization ? WhitespaceTokenizer splits text into individual tokens
Property Analysis ? wordshape() checks each token against specified properties
Boolean Results ? Returns boolean arrays indicating which tokens match the property

Understanding the Results

For the sentence "Everything that is not saved will be lost." ?

HAS_TITLE_CASE returns [True, False, False, False, False, False, False, False] ? only "Everything" is capitalized
HAS_SOME_PUNCT_OR_SYMBOL returns [False, False, False, False, False, False, False, True] ? only the period "." contains punctuation
For "Sad?", both properties return True because it's capitalized and contains the ? symbol

Conclusion

TensorFlow Text's wordshape() function provides an efficient way to analyze text properties for natural language processing tasks. Use it to identify patterns like capitalization, punctuation, and numeric values in tokenized text.

AmitDiwan

Updated on: 2026-03-25T16:36:07+05:30

311 Views

Previous Next