Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
What are uncide scripts with respect to Tensorflow and Python?
Unicode scripts are collections of Unicode code points that determine which writing system or language a character belongs to. TensorFlow provides the tf.strings.unicode_script method to identify the script for any Unicode code point, returning int32 values that correspond to International Components for Unicode (ICU) UScriptCode values.
Read More: What is TensorFlow and how Keras work with TensorFlow to create Neural Networks?
Understanding Unicode Scripts
Every Unicode character belongs to exactly one script collection. For example:
- Chinese characters belong to the Han script (code 17)
- Cyrillic characters belong to the Cyrillic script (code 8)
- Latin characters belong to the Latin script (code 25)
Basic Script Detection
Here's how to detect scripts for individual Unicode code points −
import tensorflow as tf
print("The below represent '?' and '?' respectively")
uscript = tf.strings.unicode_script([33464, 1041])
print(uscript.numpy()) # [17, 8] == [USCRIPT_HAN, USCRIPT_CYRILLIC]
The below represent '?' and '?' respectively [17 8]
Multidimensional Script Detection
The method also works with multidimensional tensors and ragged tensors −
import tensorflow as tf
# Create sample Unicode strings
unicode_strings = [
"Hello",
"TensorFlow 2.0 ? is great!",
"?????",
""
]
# Convert to Unicode code points
batch_chars_ragged = tf.strings.unicode_decode(unicode_strings, 'UTF-8')
print("Applying to multidimensional strings")
scripts = tf.strings.unicode_script(batch_chars_ragged)
print(scripts)
Applying to multidimensional strings <tf.RaggedTensor [[25, 25, 25, 25, 25], [25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 0, 25, 25, 0, 25, 25, 0, 25, 25, 0, 25, 25, 25, 25, 25, 25], [5, 5, 5, 5, 5], []]>
Script Code Mapping
Common script codes include:
| Script | Code | Example Characters |
|---|---|---|
| Common | 0 | Spaces, punctuation |
| Devanagari | 5 | ????? |
| Cyrillic | 8 | ?, ? |
| Han | 17 | ?, ? |
| Latin | 25 | A, B, C |
Conclusion
Unicode script detection in TensorFlow helps identify the writing system of characters using tf.strings.unicode_script. This is essential for text processing tasks involving multiple languages and scripts.
