How can tf.text be used to see if a string has a certain property in Python?


The ‘wordshape’ method can be used along with specific conditions such as ‘HAS_TITLE_CASE’, ‘IS_NUMERIC_VALUE’, or ‘HAS_SOME_PUNCT_OR_SYMBOL’ to see if a string has a particular property.

Read More: What is TensorFlow and how Keras work with TensorFlow to create Neural Networks?

We will use the Keras Sequential API, which is helpful in building a sequential model that is used to work with a plain stack of layers, where every layer has exactly one input tensor and one output tensor.

A neural network that contains at least one layer is known as a convolutional layer. We can use the Convolutional Neural Network to build learning model. 

TensorFlow Text contains collection of text related classes and ops that can be used with TensorFlow 2.0. The TensorFlow Text can be used to preprocess sequence modelling.

We are using the Google Colaboratory to run the below code. Google Colab or Colaboratory helps run Python code over the browser and requires zero configuration and free access to GPUs (Graphical Processing Units). Colaboratory has been built on top of Jupyter Notebook.

Tokenization is the method of breaking down a string into tokens. These tokens can be words, numbers, or punctuation. The key interfaces include Tokenizer and TokenizerWithOffsets each of which have a single method tokenize and tokenize_with_offsets respectively. There are multiple tokenizers, each of which implement TokenizerWithOffsets (which extends the Tokenizer class). This includes an option to get byte offsets into the original string. This helps know the bytes in the original string the token was created from.

A common feature used in certain natural language understanding models is to see if text string has a specific property. Wordshape defines a variety of useful regular expression based helper functions for matching various relevant patterns in your input text. Here are a few examples.

Example

print("Whitespace tokenizer is being called")
tokenizer = text.WhitespaceTokenizer()
print("Tokens being generated")
tokens = tokenizer.tokenize(['Everything that is not saved will be lost.', u'Sad☹'.encode('UTF-8')])
print("Checking if it is capitalized")
f1 = text.wordshape(tokens, text.WordShape.HAS_TITLE_CASE)
print("Checking if all the letters are uppercase")
f2 = text.wordshape(tokens, text.WordShape.IS_UPPERCASE)
print("Checking if the tokens contain punctuation")
f3 = text.wordshape(tokens, text.WordShape.HAS_SOME_PUNCT_OR_SYMBOL)
print("Checking if the token is a number")
f4 = text.wordshape(tokens, text.WordShape.IS_NUMERIC_VALUE)
print("Printing the results")
print(f1.to_list())
print(f2.to_list())
print(f3.to_list())
print(f4.to_list())

Code credit −https://www.tensorflow.org/tutorials/tensorflow_text/intro

Output

Whitespace tokenizer is being called
Tokens being generated
Checking if it is capitalized
Checking if all the letters are uppercase
Checking if the tokens contain punctuation
Checking if the token is a number
Printing the results
[[True, False, False, False, False, False, False, False], [True]]
[[False, False, False, False, False, False, False, False], [False]]
[[False, False, False, False, False, False, False, True], [True]]
[[False, False, False, False, False, False, False, False], [False]]

Explanation

  • The ‘WhitespaceTokenizer’ is called, and tokens are generated.
  • The letters are checked to see if they are uppercase.
  • It is also checked for punctuation and whether it is a number or not.
  • After these computation, bool values are displayed

Updated on: 22-Feb-2021

132 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements