How can Unicode operations be performed in Tensorflow using Python?


Unicode operations can be performed by first fetching the length of the strings, and setting this to other values (the default value is ‘byte’). The ‘encode’ method is used to convert vector of code points into encoded string scalar. This is done to determine the Unicode code points in every encoded string.

Read More: What is TensorFlow and how Keras work with TensorFlow to create Neural Networks?

Models which process natural language handle different languages that have different character sets. Unicode is considered as the standard encoding system which is used to represent character from almost all the languages. Every character is encoded with the help of a unique integer code point that is between 0 and 0x10FFFF. A Unicode string is a sequence of zero or more code values.

Let us understand how to represent Unicode strings using Python, and manipulate those using Unicode equivalents. First, we separate the Unicode strings into tokens based on script detection with the help of the Unicode equivalents of standard string ops.

We are using the Google Colaboratory to run the below code. Google Colab or Colaboratory helps run Python code over the browser and requires zero configuration and free access to GPUs (Graphical Processing Units). Colaboratory has been built on top of Jupyter Notebook.

print("The final character takes about 4 bytes in UTF-8 encoding")
thanks = u'Hello 😊'.encode('UTF-8')
num_bytes = tf.strings.length(thanks).numpy()
num_chars = tf.strings.length(thanks, unit='UTF8_CHAR').numpy()
print('{} bytes; {} UTF-8 characters'.format(num_bytes, num_chars))

Code credit: https://www.tensorflow.org/tutorials/load_data/unicode

Output

The final character takes about 4 bytes in UTF-8 encoding
10 bytes; 7 UTF-8 characters

Explanation

  • The tf.strings.length operation has a parameter unit that indicates the method in which lengths need to be computed.
  • The unit default is "BYTE", but it can be set to other values, such as "UTF8_CHAR" or "UTF16_CHAR".
  • This is done to find the number of Unicode codepoints in every encoded string.

Updated on: 20-Feb-2021

77 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements