Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How can Unicode strings be represented and manipulated in Tensorflow?
Unicode strings are sequences of characters from different languages encoded using standardized code points. TensorFlow provides several ways to represent and manipulate Unicode strings, including UTF-8 encoded scalars, UTF-16 encoded scalars, and vectors of Unicode code points.
Unicode Representation in TensorFlow
Unicode is the standard encoding system used to represent characters from almost all languages. Each character is encoded with a unique integer code point between 0 and 0x10FFFF. TensorFlow handles Unicode strings through its tf.string dtype, which stores byte strings and treats them as atomic units.
Creating Unicode Constants
You can create Unicode string constants using tf.constant() with UTF-8 encoding by default :
import tensorflow as tf
# Define a Unicode constant with emoji
unicode_constant = tf.constant(u"Thanks ?")
print("Unicode constant:", unicode_constant)
# Create tensor with multiple Unicode strings
unicode_tensor = tf.constant([u"You are", u"welcome!"])
print("Tensor shape:", unicode_tensor.shape)
Unicode constant: tf.Tensor(b'Thanks \xf0\x9f\x98\x8a', shape=(), dtype=string) Tensor shape: (2,)
UTF-8 Encoded Representation
Unicode strings are UTF-8 encoded by default in TensorFlow :
import tensorflow as tf
# UTF-8 encoded scalar representation
text_utf8 = tf.constant(u"????")
print("UTF-8 encoded:", text_utf8)
UTF-8 encoded: tf.Tensor(b'\xe8\xaf\xad\xe8\xa8\x80\xe5\xa4\x84\xe7\x90\x86', shape=(), dtype=string)
UTF-16 Encoded Representation
You can also represent Unicode strings as UTF-16 encoded scalars using Python's encode() method :
import tensorflow as tf
# UTF-16 encoded scalar representation
text_utf16be = tf.constant(u"????".encode("UTF-16-BE"))
print("UTF-16 encoded:", text_utf16be)
UTF-16 encoded: tf.Tensor(b'\x8b\xed\x8a\x00Y\x04t\x06', shape=(), dtype=string)
Unicode Code Points Vector
Another approach is to represent Unicode strings as vectors of integer code points :
import tensorflow as tf
# Vector of Unicode code points
text_chars = tf.constant([ord(char) for char in u"????"])
print("Code points:", text_chars)
Code points: tf.Tensor([35821 35328 22788 29702], shape=(4,), dtype=int32)
Unicode Representation Methods
| Method | Data Type | Description |
|---|---|---|
| UTF-8 Scalar | tf.string |
Default encoding, sequence of code points |
| UTF-16 Scalar | tf.string |
Alternative encoding using 16-bit units |
| Code Points Vector | tf.int32 |
Each position contains a single code point |
Key Points
- TensorFlow's
tf.stringdtype can hold byte strings of different lengths - Unicode strings are UTF-8 encoded by default
- String length is not included in tensor dimensions
- Python 2 uses "u" prefix for Unicode strings, Python 3 uses Unicode by default
- Two standard representations: string scalar (encoded) and int32 vector (code points)
Conclusion
TensorFlow supports Unicode string manipulation through UTF-8/UTF-16 encoded scalars and code point vectors. Use UTF-8 scalars for most text processing tasks, and code point vectors when you need character-level operations.
