Unicode strings are utf-8 encoded by default. Unicode string can be represented as UTF-8 encoded scalar values using the ‘constant’ method in Tensorflow module. Unicode strings can be represented as UTF-16 encoded scalar using the ‘encode’ method present in Tensorflow module.
Read More: What is TensorFlow and how Keras work with TensorFlow to create Neural Networks?
Models which process natural language handle different languages that have different character sets. Unicode is considered as the standard encoding system which is used to represent character from almost all the languages. Every character is encoded with the help of a unique integer code point that is between 0 and 0x10FFFF. A Unicode string is a sequence of zero or more code values.
Let us understand how to represent Unicode strings using Python, and manipulate those using Unicode equivalents. First, we separate the Unicode strings into tokens based on script detection with the help of the Unicode equivalents of standard string ops.
We are using the Google Colaboratory to run the below code. Google Colab or Colaboratory helps run Python code over the browser and requires zero configuration and free access to GPUs (Graphical Processing Units). Colaboratory has been built on top of Jupyter Notebook.
import tensorflow as tf print("A constant is defined") tf.constant(u"Thanks 😊") print("The shape of the tensor is") tf.constant([u"You are", u"welcome!"]).shape print("Unicode string represented as UTF-8 encoded scalar") text_utf8 = tf.constant(u"语言处理") print(text_utf8) print("Unicode string represented as UTF-16 encoded scalar") text_utf16be = tf.constant(u"语言处理".encode("UTF-16-BE")) print(text_utf16be) print("Unicode string represented as a vector of Unicode code points") text_chars = tf.constant([ord(char) for char in u"语言处理"]) print(text_chars)
Code credit: https://www.tensorflow.org/tutorials/load_data/unicode
A constant is defined The shape of the tensor is Unicode string represented as UTF-8 encoded scalar tf.Tensor(b'\xe8\xaf\xad\xe8\xa8\x80\xe5\xa4\x84\xe7\x90\x86', shape=(), dtype=string) Unicode string represented as UTF-16 encoded scalar tf.Tensor(b'\x8b\xed\x8a\x00Y\x04t\x06', shape=(), dtype=string) Unicode string represented as a vector of Unicode code points tf.Tensor([35821 35328 22788 29702], shape=(4,), dtype=int32)