Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How can tf.text be used to see if a string has a certain property in Python?
The tf.text.wordshape() method can be used along with specific conditions such as HAS_TITLE_CASE, IS_NUMERIC_VALUE, or HAS_SOME_PUNCT_OR_SYMBOL to see if a string has a particular property. This is useful for text preprocessing and natural language understanding tasks.
TensorFlow Text provides collection of text-related classes and operations that work with TensorFlow 2.0. It includes tokenizers and word shape analysis functions that help identify specific patterns and properties in text data.
What is Word Shape Analysis?
Word shape analysis examines text tokens to identify common properties like capitalization, numeric values, or punctuation. The tf.text.wordshape() function uses regular expression-based helper functions to match various patterns in your input text.
Common Word Shape Properties
-
HAS_TITLE_CASE? Checks if text starts with capital letter -
IS_UPPERCASE? Checks if all letters are uppercase -
HAS_SOME_PUNCT_OR_SYMBOL? Checks for punctuation or symbols -
IS_NUMERIC_VALUE? Checks if token represents a number
Example
Here's how to use tf.text to analyze string properties ?
import tensorflow as tf
import tensorflow_text as text
print("Whitespace tokenizer is being called")
tokenizer = text.WhitespaceTokenizer()
print("Tokens being generated")
tokens = tokenizer.tokenize(['Everything that is not saved will be lost.', 'Sad?'.encode('UTF-8')])
print("Checking if it is capitalized")
f1 = text.wordshape(tokens, text.WordShape.HAS_TITLE_CASE)
print("Checking if all the letters are uppercase")
f2 = text.wordshape(tokens, text.WordShape.IS_UPPERCASE)
print("Checking if the tokens contain punctuation")
f3 = text.wordshape(tokens, text.WordShape.HAS_SOME_PUNCT_OR_SYMBOL)
print("Checking if the token is a number")
f4 = text.wordshape(tokens, text.WordShape.IS_NUMERIC_VALUE)
print("Printing the results")
print("Title case:", f1.numpy().tolist())
print("Uppercase:", f2.numpy().tolist())
print("Has punctuation:", f3.numpy().tolist())
print("Is numeric:", f4.numpy().tolist())
Whitespace tokenizer is being called Tokens being generated Checking if it is capitalized Checking if all the letters are uppercase Checking if the tokens contain punctuation Checking if the token is a number Printing the results Title case: [[True, False, False, False, False, False, False, False], [True]] Uppercase: [[False, False, False, False, False, False, False, False], [False]] Has punctuation: [[False, False, False, False, False, False, False, True], [True]] Is numeric: [[False, False, False, False, False, False, False, False], [False]]
How It Works
The process involves these steps ?
-
Tokenization ?
WhitespaceTokenizersplits text into individual tokens -
Property Analysis ?
wordshape()checks each token against specified properties - Boolean Results ? Returns boolean arrays indicating which tokens match the property
Understanding the Results
For the sentence "Everything that is not saved will be lost." ?
-
HAS_TITLE_CASEreturns[True, False, False, False, False, False, False, False]? only "Everything" is capitalized -
HAS_SOME_PUNCT_OR_SYMBOLreturns[False, False, False, False, False, False, False, True]? only the period "." contains punctuation - For "Sad?", both properties return
Truebecause it's capitalized and contains the ? symbol
Conclusion
TensorFlow Text's wordshape() function provides an efficient way to analyze text properties for natural language processing tasks. Use it to identify patterns like capitalization, punctuation, and numeric values in tokenized text.
