Article Categories

Selected Reading

How can Unicode strings be represented and manipulated in Tensorflow?

Python Server Side Programming Programming Tensorflow

Unicode strings are sequences of characters from different languages encoded using standardized code points. TensorFlow provides several ways to represent and manipulate Unicode strings, including UTF-8 encoded scalars, UTF-16 encoded scalars, and vectors of Unicode code points.

Unicode Representation in TensorFlow

Unicode is the standard encoding system used to represent characters from almost all languages. Each character is encoded with a unique integer code point between 0 and 0x10FFFF. TensorFlow handles Unicode strings through its tf.string dtype, which stores byte strings and treats them as atomic units.

Creating Unicode Constants

You can create Unicode string constants using tf.constant() with UTF-8 encoding by default :

import tensorflow as tf

# Define a Unicode constant with emoji
unicode_constant = tf.constant(u"Thanks ?")
print("Unicode constant:", unicode_constant)

# Create tensor with multiple Unicode strings
unicode_tensor = tf.constant([u"You are", u"welcome!"])
print("Tensor shape:", unicode_tensor.shape)

Unicode constant: tf.Tensor(b'Thanks \xf0\x9f\x98\x8a', shape=(), dtype=string)
Tensor shape: (2,)

UTF-8 Encoded Representation

Unicode strings are UTF-8 encoded by default in TensorFlow :

import tensorflow as tf

# UTF-8 encoded scalar representation
text_utf8 = tf.constant(u"????")
print("UTF-8 encoded:", text_utf8)

UTF-8 encoded: tf.Tensor(b'\xe8\xaf\xad\xe8\xa8\x80\xe5\xa4\x84\xe7\x90\x86', shape=(), dtype=string)

UTF-16 Encoded Representation

You can also represent Unicode strings as UTF-16 encoded scalars using Python's encode() method :

import tensorflow as tf

# UTF-16 encoded scalar representation
text_utf16be = tf.constant(u"????".encode("UTF-16-BE"))
print("UTF-16 encoded:", text_utf16be)

UTF-16 encoded: tf.Tensor(b'\x8b\xed\x8a\x00Y\x04t\x06', shape=(), dtype=string)

Unicode Code Points Vector

Another approach is to represent Unicode strings as vectors of integer code points :

import tensorflow as tf

# Vector of Unicode code points
text_chars = tf.constant([ord(char) for char in u"????"])
print("Code points:", text_chars)

Code points: tf.Tensor([35821 35328 22788 29702], shape=(4,), dtype=int32)

Unicode Representation Methods

Method	Data Type	Description
UTF-8 Scalar	`tf.string`	Default encoding, sequence of code points
UTF-16 Scalar	`tf.string`	Alternative encoding using 16-bit units
Code Points Vector	`tf.int32`	Each position contains a single code point

Key Points

TensorFlow's tf.string dtype can hold byte strings of different lengths
Unicode strings are UTF-8 encoded by default
String length is not included in tensor dimensions
Python 2 uses "u" prefix for Unicode strings, Python 3 uses Unicode by default
Two standard representations: string scalar (encoded) and int32 vector (code points)

Conclusion

TensorFlow supports Unicode string manipulation through UTF-8/UTF-16 encoded scalars and code point vectors. Use UTF-8 scalars for most text processing tasks, and code point vectors when you need character-level operations.

AmitDiwan

Updated on: 2026-03-25T16:05:02+05:30

378 Views

Previous Next