How to encode multiple strings that have the same length using Tensorflow and Python?


Multiple strings of same length can be encoded using the ‘tf.Tensor’ as an input value. When encoding multiple strings of varying lengths need to be encoded, a tf.RaggedTensor should be used as an input. If a tensor contains multiple strings in padded/sparse format, it needs to be converted to a tf.RaggedTensor. Then, the method unicode_encode should be called on it.

Read More: What is TensorFlow and how Keras work with TensorFlow to create Neural Networks?

Let us understand how to represent Unicode strings using Python, and manipulate those using Unicode equivalents. First, we separate the Unicode strings into tokens based on script detection with the help of the Unicode equivalents of standard string ops.

We are using the Google Colaboratory to run the below code. Google Colab or Colaboratory helps run Python code over the browser and requires zero configuration and free access to GPUs (Graphical Processing Units). Colaboratory has been built on top of Jupyter Notebook.

print("When encoding multiple strings of   same lengths, tf.Tensor is used as input")
tf.strings.unicode_encode([[99, 97, 116], [100, 111, 103], [ 99, 111, 119]],output_encoding='UTF-8')
print("When encoding multiple strings with varying length, a tf.RaggedTensor should be used as input:")
tf.strings.unicode_encode(batch_chars_ragged, output_encoding='UTF-8')
print("If there is a tensor with multiple strings in padded/sparse format, convert it to a tf.RaggedTensor before calling unicode_encode")
tf.strings.unicode_encode(
   tf.RaggedTensor.from_sparse(batch_chars_sparse),
   output_encoding='UTF-8')
tf.strings.unicode_encode(
   tf.RaggedTensor.from_tensor(batch_chars_padded, padding=-1),
   output_encoding='UTF-8')

Code credit: https://www.tensorflow.org/tutorials/load_data/unicode

Output

When encoding multiple strings of   same lengths, tf.Tensor is used as input
When encoding multiple strings with varying length, a tf.RaggedTensor should be used as input:
If there is a tensor with multiple strings in padded/sparse format, convert it to a tf.RaggedTensor before calling unicode_encode

Explanation

  • When encoding multiple strings of same lengths, tf.Tensor can be used as input.
  • When encoding multiple strings that have varying length, a tf.RaggedTensor can be used as input.
  • When there is a tensor with multiple strings in padded/sparse format, it needs to be converted to a tf.RaggedTensor before calling unicode_encode on it.

Updated on: 20-Feb-2021

175 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements