How can tf.text be used to see if a string has a certain property in Python?

The tf.text.wordshape() method can be used along with specific conditions such as HAS_TITLE_CASE, IS_NUMERIC_VALUE, or HAS_SOME_PUNCT_OR_SYMBOL to see if a string has a particular property. This is useful for text preprocessing and natural language understanding tasks.

TensorFlow Text provides collection of text-related classes and operations that work with TensorFlow 2.0. It includes tokenizers and word shape analysis functions that help identify specific patterns and properties in text data.

What is Word Shape Analysis?

Word shape analysis examines text tokens to identify common properties like capitalization, numeric values, or punctuation. The tf.text.wordshape() function uses regular expression-based helper functions to match various patterns in your input text.

Common Word Shape Properties

  • HAS_TITLE_CASE ? Checks if text starts with capital letter
  • IS_UPPERCASE ? Checks if all letters are uppercase
  • HAS_SOME_PUNCT_OR_SYMBOL ? Checks for punctuation or symbols
  • IS_NUMERIC_VALUE ? Checks if token represents a number

Example

Here's how to use tf.text to analyze string properties ?

import tensorflow as tf
import tensorflow_text as text

print("Whitespace tokenizer is being called")
tokenizer = text.WhitespaceTokenizer()

print("Tokens being generated")
tokens = tokenizer.tokenize(['Everything that is not saved will be lost.', 'Sad?'.encode('UTF-8')])

print("Checking if it is capitalized")
f1 = text.wordshape(tokens, text.WordShape.HAS_TITLE_CASE)

print("Checking if all the letters are uppercase")
f2 = text.wordshape(tokens, text.WordShape.IS_UPPERCASE)

print("Checking if the tokens contain punctuation")
f3 = text.wordshape(tokens, text.WordShape.HAS_SOME_PUNCT_OR_SYMBOL)

print("Checking if the token is a number")
f4 = text.wordshape(tokens, text.WordShape.IS_NUMERIC_VALUE)

print("Printing the results")
print("Title case:", f1.numpy().tolist())
print("Uppercase:", f2.numpy().tolist())
print("Has punctuation:", f3.numpy().tolist())
print("Is numeric:", f4.numpy().tolist())
Whitespace tokenizer is being called
Tokens being generated
Checking if it is capitalized
Checking if all the letters are uppercase
Checking if the tokens contain punctuation
Checking if the token is a number
Printing the results
Title case: [[True, False, False, False, False, False, False, False], [True]]
Uppercase: [[False, False, False, False, False, False, False, False], [False]]
Has punctuation: [[False, False, False, False, False, False, False, True], [True]]
Is numeric: [[False, False, False, False, False, False, False, False], [False]]

How It Works

The process involves these steps ?

  1. Tokenization ? WhitespaceTokenizer splits text into individual tokens
  2. Property Analysis ? wordshape() checks each token against specified properties
  3. Boolean Results ? Returns boolean arrays indicating which tokens match the property

Understanding the Results

For the sentence "Everything that is not saved will be lost." ?

  • HAS_TITLE_CASE returns [True, False, False, False, False, False, False, False] ? only "Everything" is capitalized
  • HAS_SOME_PUNCT_OR_SYMBOL returns [False, False, False, False, False, False, False, True] ? only the period "." contains punctuation
  • For "Sad?", both properties return True because it's capitalized and contains the ? symbol

Conclusion

TensorFlow Text's wordshape() function provides an efficient way to analyze text properties for natural language processing tasks. Use it to identify patterns like capitalization, punctuation, and numeric values in tokenized text.

Updated on: 2026-03-25T16:36:07+05:30

254 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements