How to represent Unicode strings as UTF-8 encoded strings using Tensorflow and Python?


A set of Unicode strings can be represented as UTF8-encoded string using the ‘encode’ method.

Read More: What is TensorFlow and how Keras work with TensorFlow to create Neural Networks?

Models which process natural language handle different languages that have different character sets. Unicode is considered as the standard encoding system which is used to represent character from almost all the languages. Every character is encoded with the help of a unique integer code point that is between 0 and 0x10FFFF. A Unicode string is a sequence of zero or more code values.

Let us understand how to represent Unicode strings using Python, and manipulate those using Unicode equivalents. First, we separate the Unicode strings into tokens based on script detection with the help of the Unicode equivalents of standard string ops.

We are using the Google Colaboratory to run the below code. Google Colab or Colaboratory helps run Python code over the browser and requires zero configuration and free access to GPUs (Graphical Processing Units). Colaboratory has been built on top of Jupyter Notebook.

print("A set of Unicode strings which is represented as a UTF8-encoded string")
batch_utf8 = [s.encode('UTF-8') for s in[u'hÃllo',   u'What is the weather tomorrow',u'Göödnight', u'😊']]
batch_chars_ragged = tf.strings.unicode_decode(batch_utf8,
input_encoding='UTF-8')
for sentence_chars in batch_chars_ragged.to_list():
   print(sentence_chars)
print("Dense tensor with padding are printed")
batch_chars_padded = batch_chars_ragged.to_tensor(default_value=-1)
print(batch_chars_padded.numpy())
print("Converting to sparse matrix")
batch_chars_sparse = batch_chars_ragged.to_sparse()

Code credit: https://www.tensorflow.org/tutorials/load_data/unicode

Output

A set of Unicode strings which is represented as a UTF8-encoded string
[104, 195, 108, 108, 111]
[87, 104, 97, 116, 32, 105, 115, 32, 116, 104, 101, 32, 119, 101, 97, 116, 104, 101, 114, 32, 116, 111, 109, 111, 114, 114, 111, 119]
[71, 246, 246, 100, 110, 105, 103, 104, 116]
[128522]
Dense tensor with padding are printed
[[ 104      195      108      108      111       -1       -1       -1       -1       -1
   -1       -1       -1       -1       -1       -1       -1       -1       -1       -1
   -1       -1       -1       -1       -1       -1       -1       -1]
[87      104       97      116       32      105      115       32      116      104
 101       32      119      101       97      116      104      101      114       32
 116      111      109      111      114      114      111      119]
[71      246      246      100      110      105      103      104      116       -1
   -1       -1       -1       -1       -1       -1       -1       -1       -1       -1
   -1       -1       -1       -1       -1       -1       -1       -1]
[128522       -1       -1       -1       -1       -1       -1       -1       -1       -1
   -1       -1       -1       -1       -1       -1       -1       -1       -1       -1
   -1       -1       -1       -1       -1       -1       -1       -1]]
Converting to sparse matrix

Explanation

  • When multiple strings are decoded, the number of characters in every string may not be equal.
  • The result would be a tf.RaggedTensor, where the length of the innermost dimension varies and this variation depends on the number of characters in every string.
  • This tf.RaggedTensor can be used directly, or it can be converted into a dense tf.Tensor with padding using using the methods tf.RaggedTensor.to_tensor or to a tf.SparseTensor using tf.RaggedTensor.to_sparse.

Updated on: 19-Feb-2021

317 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements