How can Unicode strings be represented and manipulated in Tensorflow?

PythonServer Side ProgrammingProgrammingTensorflow

Unicode strings are utf-8 encoded by default. Unicode string can be represented as UTF-8 encoded scalar values using the ‘constant’ method in Tensorflow module. Unicode strings can be represented as UTF-16 encoded scalar using the ‘encode’ method present in Tensorflow module.

Read More: What is TensorFlow and how Keras work with TensorFlow to create Neural Networks?

Models which process natural language handle different languages that have different character sets. Unicode is considered as the standard encoding system which is used to represent character from almost all the languages. Every character is encoded with the help of a unique integer code point that is between 0 and 0x10FFFF. A Unicode string is a sequence of zero or more code values.

Let us understand how to represent Unicode strings using Python, and manipulate those using Unicode equivalents. First, we separate the Unicode strings into tokens based on script detection with the help of the Unicode equivalents of standard string ops.

We are using the Google Colaboratory to run the below code. Google Colab or Colaboratory helps run Python code over the browser and requires zero configuration and free access to GPUs (Graphical Processing Units). Colaboratory has been built on top of Jupyter Notebook.

import tensorflow as tf
print("A constant is defined")
tf.constant(u"Thanks 😊")
print("The shape of the tensor is")
tf.constant([u"You are", u"welcome!"]).shape
print("Unicode string represented as UTF-8 encoded scalar")
text_utf8 = tf.constant(u"语言处理")
print(text_utf8)
print("Unicode string represented as UTF-16 encoded scalar")
text_utf16be = tf.constant(u"语言处理".encode("UTF-16-BE"))
print(text_utf16be)
print("Unicode string represented as a vector of Unicode code points")
text_chars = tf.constant([ord(char) for char in u"语言处理"])
print(text_chars)

Code credit: https://www.tensorflow.org/tutorials/load_data/unicode

Output

A constant is defined
The shape of the tensor is
Unicode string represented as UTF-8 encoded scalar
tf.Tensor(b'\xe8\xaf\xad\xe8\xa8\x80\xe5\xa4\x84\xe7\x90\x86', shape=(), dtype=string)
Unicode string represented as UTF-16 encoded scalar
tf.Tensor(b'\x8b\xed\x8a\x00Y\x04t\x06', shape=(), dtype=string)
Unicode string represented as a vector of Unicode code points
tf.Tensor([35821 35328 22788 29702], shape=(4,), dtype=int32)

Explanation

  • The TensorFlow tf.string is a basic dtype.
  • It allows the user to build tensors of byte strings.
  • Unicode strings are utf-8 encoded by default.
  • A tf.string tensor has the ability to hold byte strings of different lengths since byte strings are treated as atomic units.
  • The string length is not included in tensor dimensions.
  • When Python is used to construct strings, unicode handling changes betweeen v2 and v3. In v2, unicode strings are indicated by the "u" prefix.
  • In v3, strings are unicode-encoded by default.
  • There are two standard ways of representing Unicode string in TensorFlow:
  • string scalar: A sequence of code points are encoded with a known character encoding.
  • int32 vector: A method where every position contains a single code point.
raja
Published on 11-Feb-2021 06:48:13
Advertisements