What are uncide scripts with respect to Tensorflow and Python?

Unicode scripts are collections of Unicode code points that determine which writing system or language a character belongs to. TensorFlow provides the tf.strings.unicode_script method to identify the script for any Unicode code point, returning int32 values that correspond to International Components for Unicode (ICU) UScriptCode values.

Read More: What is TensorFlow and how Keras work with TensorFlow to create Neural Networks?

Understanding Unicode Scripts

Every Unicode character belongs to exactly one script collection. For example:

  • Chinese characters belong to the Han script (code 17)
  • Cyrillic characters belong to the Cyrillic script (code 8)
  • Latin characters belong to the Latin script (code 25)

Basic Script Detection

Here's how to detect scripts for individual Unicode code points −

import tensorflow as tf

print("The below represent '?' and '?' respectively")
uscript = tf.strings.unicode_script([33464, 1041])
print(uscript.numpy())  # [17, 8] == [USCRIPT_HAN, USCRIPT_CYRILLIC]
The below represent '?' and '?' respectively
[17  8]

Multidimensional Script Detection

The method also works with multidimensional tensors and ragged tensors −

import tensorflow as tf

# Create sample Unicode strings
unicode_strings = [
    "Hello",
    "TensorFlow 2.0 ? is great!",
    "?????",
    ""
]

# Convert to Unicode code points
batch_chars_ragged = tf.strings.unicode_decode(unicode_strings, 'UTF-8')
print("Applying to multidimensional strings")
scripts = tf.strings.unicode_script(batch_chars_ragged)
print(scripts)
Applying to multidimensional strings
<tf.RaggedTensor [[25, 25, 25, 25, 25], [25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 0, 25, 25, 0, 25, 25, 0, 25, 25, 0, 25, 25, 25, 25, 25, 25], [5, 5, 5, 5, 5], []]>

Script Code Mapping

Common script codes include:

Script Code Example Characters
Common 0 Spaces, punctuation
Devanagari 5 ?????
Cyrillic 8 ?, ?
Han 17 ?, ?
Latin 25 A, B, C

Conclusion

Unicode script detection in TensorFlow helps identify the writing system of characters using tf.strings.unicode_script. This is essential for text processing tasks involving multiple languages and scripts.

Updated on: 2026-03-25T16:07:08+05:30

195 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements