What are uncide scripts with respect to Tensorflow and Python?


Every Unicode code point belongs to a single collection of codepoints which is known as a script. A character's script determines the language to which the character would belong. TensorFlow comes with ‘strings.unicode_script’ method that helps find which script would be used by a given codepoint. The script codes are int32 values which can be mapped to International Components for Unicode (ICU) UScriptCode values

Read More: What is TensorFlow and how Keras work with TensorFlow to create Neural Networks?

We will no see how to represent Unicode strings using Python, and manipulate those using Unicode equivalents. First, separate the Unicode strings into tokens based on script detection with the help of the Unicode equivalents of standard string ops.

We are using the Google Colaboratory to run the below code. Google Colab or Colaboratory helps run Python code over the browser and requires zero configuration and free access to GPUs (Graphical Processing Units). Colaboratory has been built on top of Jupyter Notebook.

print("The below represent '芸' and 'Б' respectively")
uscript = tf.strings.unicode_script([33464, 1041])  
print(uscript.numpy())   # [17, 8] == [USCRIPT_HAN, USCRIPT_CYRILLIC]
print("Applying to multidimensional strings")
print(tf.strings.unicode_script(batch_chars_ragged))

Code credit: https://www.tensorflow.org/tutorials/load_data/unicode

Output

The below represent '芸' and 'Б' respectively
[17   8]
Applying to multidimensional strings
<tf.RaggedTensor [[25, 25, 25, 25, 25], [25, 25, 25, 25, 0, 25, 25, 0, 25, 25, 25, 0, 25, 25, 25, 25, 25, 25, 25, 0, 25, 25, 25, 25, 25, 25, 25, 25], [25, 25, 25, 25, 25, 25, 25, 25, 25], [0]]>

Explanation

  • Every Unicode code point belongs to a single collection of codepoints that is known as a script.
  • A character's script helps determine which language the character could belong to.
  • TensorFlow provides tf.strings.unicode_script operation to find out which script a given codepoint will use.
  • The script codes are int32 values that map to International Components for Unicode (ICU) UScriptCode values.
  • The tf.strings.unicode_script operation can be applied to multidimensional tf.Tensors or tf.RaggedTensors of codepoints as well.

Updated on: 20-Feb-2021

58 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements