Article Categories

Selected Reading

What are uncide scripts with respect to Tensorflow and Python?

Python Server Side Programming Programming Tensorflow

Unicode scripts are collections of Unicode code points that determine which writing system or language a character belongs to. TensorFlow provides the tf.strings.unicode_script method to identify the script for any Unicode code point, returning int32 values that correspond to International Components for Unicode (ICU) UScriptCode values.

Understanding Unicode Scripts

Every Unicode character belongs to exactly one script collection. For example:

Chinese characters belong to the Han script (code 17)
Cyrillic characters belong to the Cyrillic script (code 8)
Latin characters belong to the Latin script (code 25)

Basic Script Detection

Here's how to detect scripts for individual Unicode code points −

import tensorflow as tf

print("The below represent '?' and '?' respectively")
uscript = tf.strings.unicode_script([33464, 1041])
print(uscript.numpy())  # [17, 8] == [USCRIPT_HAN, USCRIPT_CYRILLIC]

The below represent '?' and '?' respectively
[17  8]

Multidimensional Script Detection

The method also works with multidimensional tensors and ragged tensors −

import tensorflow as tf

# Create sample Unicode strings
unicode_strings = [
    "Hello",
    "TensorFlow 2.0 ? is great!",
    "?????",
    ""
]

# Convert to Unicode code points
batch_chars_ragged = tf.strings.unicode_decode(unicode_strings, 'UTF-8')
print("Applying to multidimensional strings")
scripts = tf.strings.unicode_script(batch_chars_ragged)
print(scripts)

Applying to multidimensional strings
<tf.RaggedTensor [[25, 25, 25, 25, 25], [25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 0, 25, 25, 0, 25, 25, 0, 25, 25, 0, 25, 25, 25, 25, 25, 25], [5, 5, 5, 5, 5], []]>

Script Code Mapping

Common script codes include:

Script	Code	Example Characters
Common	0	Spaces, punctuation
Devanagari	5	?????
Cyrillic	8	?, ?
Han	17	?, ?
Latin	25	A, B, C

Conclusion

Unicode script detection in TensorFlow helps identify the writing system of characters using tf.strings.unicode_script. This is essential for text processing tasks involving multiple languages and scripts.

AmitDiwan

Updated on: 2026-03-25T16:07:08+05:30

239 Views

Previous Next