How can Unicode strings be represented and manipulated in Tensorflow?

Unicode strings are sequences of characters from different languages encoded using standardized code points. TensorFlow provides several ways to represent and manipulate Unicode strings, including UTF-8 encoded scalars, UTF-16 encoded scalars, and vectors of Unicode code points.

Unicode Representation in TensorFlow

Unicode is the standard encoding system used to represent characters from almost all languages. Each character is encoded with a unique integer code point between 0 and 0x10FFFF. TensorFlow handles Unicode strings through its tf.string dtype, which stores byte strings and treats them as atomic units.

Creating Unicode Constants

You can create Unicode string constants using tf.constant() with UTF-8 encoding by default :

import tensorflow as tf

# Define a Unicode constant with emoji
unicode_constant = tf.constant(u"Thanks ?")
print("Unicode constant:", unicode_constant)

# Create tensor with multiple Unicode strings
unicode_tensor = tf.constant([u"You are", u"welcome!"])
print("Tensor shape:", unicode_tensor.shape)
Unicode constant: tf.Tensor(b'Thanks \xf0\x9f\x98\x8a', shape=(), dtype=string)
Tensor shape: (2,)

UTF-8 Encoded Representation

Unicode strings are UTF-8 encoded by default in TensorFlow :

import tensorflow as tf

# UTF-8 encoded scalar representation
text_utf8 = tf.constant(u"????")
print("UTF-8 encoded:", text_utf8)
UTF-8 encoded: tf.Tensor(b'\xe8\xaf\xad\xe8\xa8\x80\xe5\xa4\x84\xe7\x90\x86', shape=(), dtype=string)

UTF-16 Encoded Representation

You can also represent Unicode strings as UTF-16 encoded scalars using Python's encode() method :

import tensorflow as tf

# UTF-16 encoded scalar representation
text_utf16be = tf.constant(u"????".encode("UTF-16-BE"))
print("UTF-16 encoded:", text_utf16be)
UTF-16 encoded: tf.Tensor(b'\x8b\xed\x8a\x00Y\x04t\x06', shape=(), dtype=string)

Unicode Code Points Vector

Another approach is to represent Unicode strings as vectors of integer code points :

import tensorflow as tf

# Vector of Unicode code points
text_chars = tf.constant([ord(char) for char in u"????"])
print("Code points:", text_chars)
Code points: tf.Tensor([35821 35328 22788 29702], shape=(4,), dtype=int32)

Unicode Representation Methods

Method Data Type Description
UTF-8 Scalar tf.string Default encoding, sequence of code points
UTF-16 Scalar tf.string Alternative encoding using 16-bit units
Code Points Vector tf.int32 Each position contains a single code point

Key Points

  • TensorFlow's tf.string dtype can hold byte strings of different lengths
  • Unicode strings are UTF-8 encoded by default
  • String length is not included in tensor dimensions
  • Python 2 uses "u" prefix for Unicode strings, Python 3 uses Unicode by default
  • Two standard representations: string scalar (encoded) and int32 vector (code points)

Conclusion

TensorFlow supports Unicode string manipulation through UTF-8/UTF-16 encoded scalars and code point vectors. Use UTF-8 scalars for most text processing tasks, and code point vectors when you need character-level operations.

Updated on: 2026-03-25T16:05:02+05:30

312 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements