Every Unicode code point belongs to a single collection of codepoints which is known as a script. A character's script determines the language to which the character would belong. TensorFlow comes with ‘strings.unicode_script’ method that helps find which script would be used by a given codepoint. The script codes are int32 values which can be mapped to International Components for Unicode (ICU) UScriptCode values
Read More: What is TensorFlow and how Keras work with TensorFlow to create Neural Networks?
We will no see how to represent Unicode strings using Python, and manipulate those using Unicode equivalents. First, separate the Unicode strings into tokens based on script detection with the help of the Unicode equivalents of standard string ops.
We are using the Google Colaboratory to run the below code. Google Colab or Colaboratory helps run Python code over the browser and requires zero configuration and free access to GPUs (Graphical Processing Units). Colaboratory has been built on top of Jupyter Notebook.
print("The below represent '芸' and 'Б' respectively") uscript = tf.strings.unicode_script([33464, 1041]) print(uscript.numpy()) # [17, 8] == [USCRIPT_HAN, USCRIPT_CYRILLIC] print("Applying to multidimensional strings") print(tf.strings.unicode_script(batch_chars_ragged))
Code credit: https://www.tensorflow.org/tutorials/load_data/unicode
The below represent '芸' and 'Б' respectively [17 8] Applying to multidimensional strings <tf.RaggedTensor [[25, 25, 25, 25, 25], [25, 25, 25, 25, 0, 25, 25, 0, 25, 25, 25, 0, 25, 25, 25, 25, 25, 25, 25, 0, 25, 25, 25, 25, 25, 25, 25, 25], [25, 25, 25, 25, 25, 25, 25, 25, 25], [0]]>