How can Tensorflow and Tensorflow text be used to tokenize string data?

Tensorflow text can be used to tokenize string data with the help of the ‘WhitespaceTokenizer’ which is a tokenizer that is created, after which the ‘tokenize’ method present in ‘WhitespaceTokenizer’ is called on the string.

Read More: What is TensorFlow and how Keras work with TensorFlow to create Neural Networks?

We will use the Keras Sequential API, which is helpful in building a sequential model that is used to work with a plain stack of layers, where every layer has exactly one input tensor and one output tensor.

A neural network that contains at least one layer is known as a convolutional layer. We can use the Convolutional Neural Network to build learning model. 

TensorFlow Text contains collection of text related classes and ops that can be used with TensorFlow 2.0. The TensorFlow Text can be used to preprocess sequence modelling.

We are using the Google Colaboratory to run the below code. Google Colab or Colaboratory helps run Python code over the browser and requires zero configuration and free access to GPUs (Graphical Processing Units). Colaboratory has been built on top of Jupyter Notebook.

Tokenization is the method of breaking down a string into tokens. These tokens can be words, numbers, or punctuation.

The important interfaces include Tokenizer and TokenizerWithOffsets each of which have a single method tokenize and tokenize_with_offsets respectively. There are multiple tokenizers, each of which implement TokenizerWithOffsets (which extends the Tokenizer class). This includes an option to get byte offsets into the original string. This helps know the bytes in the original string the token was created from.

All the tokenizers return RaggedTensors with the inner-most dimension of tokens that are mapped to the original individual strings. The resulting shape's rank increases by one. Following is an example:


print("Whitespace tokenizer is being called")
tokenizer = text.WhitespaceTokenizer()
tokens = tokenizer.tokenize(['everything not saved will be lost.', u'Sad☹'.encode('UTF-8')])
print("The tokenized data is converted to a list")

Code credit −


Whitespace tokenizer is being called
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/util/ batch_gather (from tensorflow.python.ops.array_ops) is deprecated and will be removed after 2017-10-25.
Instructions for updating:
`tf.batch_gather` is deprecated, please use `tf.gather` with `batch_dims=-1` instead.
The tokenized data is converted to a list
[[b'everything', b'not', b'saved', b'will', b'be', b'lost.'], [b'Sad\xe2\x98\xb9']]


  • The above code implements a basic tokenizer.

  • This tokenizer splits UTF-8 strings on ICU (International Components for Unicode) defined whitespace characters.