How can Unicode string be split, and byte offset be specified with Tensorflow & Python?

PythonServer Side ProgrammingProgrammingTensorflow

Unicode string can be split, and byte offset can be specified using the ‘unicode_split’ method and the ‘unicode_decode_with_offsets’methods respectively. These methods are present in the ‘string’ class of ‘tensorflow’ module.

Read More: What is TensorFlow and how Keras work with TensorFlow to create Neural Networks?

To begin, represent Unicode strings using Python, and manipulate those using Unicode equivalents. Separate the Unicode strings into tokens based on script detection with the help of the Unicode equivalents of standard string ops.

We are using the Google Colaboratory to run the below code. Google Colab or Colaboratory helps run Python code over the browser and requires zero configuration and free access to GPUs (Graphical Processing Units). Colaboratory has been built on top of Jupyter Notebook.

print("Split unicode strings")
tf.strings.unicode_split(thanks, 'UTF-8').numpy()
codepoints, offsets = tf.strings.unicode_decode_with_offsets(u"🎈🎉🎊", 'UTF-8')
print("Printing byte offset for characters")
for (codepoint, offset) in zip(codepoints.numpy(), offsets.numpy()):
   print("At byte offset {}: codepoint {}".format(offset, codepoint))

Code credit:


Split unicode strings
Printing byte offset for characters
At byte offset 0: codepoint 127880
At byte offset 4: codepoint 127881
At byte offset 8: codepoint 127882


  • The tf.strings.unicode_split operation splits the unicode strings into substrings of individual characters.
  • The character tensor that is generated has to be aligned by tf.strings.unicode_decode with the original string.
  • For this purpose, it is required to know the offset where each character begins.
  • The method tf.strings.unicode_decode_with_offsets is similar to unicode_decode method, except that the former returns a second tensor that contains the start offset of each character.
Published on 11-Feb-2021 07:31:56