Article Categories

Selected Reading

How can Unicode string be split, and byte offset be specified with Tensorflow & Python?

Python Server Side Programming Programming Tensorflow

Unicode strings can be split into individual characters, and byte offsets can be specified using TensorFlow's tf.strings.unicode_split and tf.strings.unicode_decode_with_offsets methods. These are essential for processing Unicode text in machine learning applications.

Splitting Unicode Strings

The tf.strings.unicode_split method splits Unicode strings into individual character tokens based on the specified encoding ?

import tensorflow as tf

# Create a Unicode string
thanks = "Thanks! ?"

print("Split unicode strings")
result = tf.strings.unicode_split(thanks, 'UTF-8')
print(result.numpy())

Split unicode strings
[b'T' b'h' b'a' b'n' b'k' b's' b'!' b' ' b'\xf0\x9f\x91\x8d']

Getting Byte Offsets for Characters

The tf.strings.unicode_decode_with_offsets method returns both Unicode codepoints and their byte offsets within the original string ?

import tensorflow as tf

# Unicode string with emoji characters
unicode_string = "???"

codepoints, offsets = tf.strings.unicode_decode_with_offsets(unicode_string, 'UTF-8')

print("Printing byte offset for characters")
for (codepoint, offset) in zip(codepoints.numpy(), offsets.numpy()):
    print("At byte offset {}: codepoint {}".format(offset, codepoint))

Printing byte offset for characters
At byte offset 0: codepoint 127880
At byte offset 4: codepoint 127881
At byte offset 8: codepoint 127882

How It Works

The tf.strings.unicode_split operation splits Unicode strings into substrings of individual characters
The tf.strings.unicode_decode_with_offsets method is similar to unicode_decode, but also returns byte offset positions
Each emoji character takes 4 bytes in UTF-8 encoding, which is why the offsets are 0, 4, and 8
The codepoints represent the Unicode values for each character (127880 = ?, 127881 = ?, 127882 = ?)

Practical Example

Here's a complete example showing both methods working together ?

import tensorflow as tf

# Mixed Unicode string
text = "Hello ?? ?"

# Split into characters
split_chars = tf.strings.unicode_split(text, 'UTF-8')
print("Characters:", split_chars.numpy())

# Get codepoints and offsets
codepoints, offsets = tf.strings.unicode_decode_with_offsets(text, 'UTF-8')
print("\nCharacter analysis:")
for i, (codepoint, offset) in enumerate(zip(codepoints.numpy(), offsets.numpy())):
    char = chr(codepoint)
    print(f"Position {i}: '{char}' (codepoint: {codepoint}, byte offset: {offset})")

Characters: [b'H' b'e' b'l' b'l' b'o' b' ' b'\xe4\xb8\x96' b'\xe7\x95\x8c' b' '
 b'\xf0\x9f\x8c\x8d']

Character analysis:
Position 0: 'H' (codepoint: 72, byte offset: 0)
Position 1: 'e' (codepoint: 101, byte offset: 1)
Position 2: 'l' (codepoint: 108, byte offset: 2)
Position 3: 'l' (codepoint: 108, byte offset: 3)
Position 4: 'o' (codepoint: 111, byte offset: 4)
Position 5: ' ' (codepoint: 32, byte offset: 5)
Position 6: '?' (codepoint: 19990, byte offset: 6)
Position 7: '?' (codepoint: 30028, byte offset: 9)
Position 8: ' ' (codepoint: 32, byte offset: 12)
Position 9: '?' (codepoint: 127757, byte offset: 13)

Conclusion

TensorFlow's Unicode string methods enable efficient processing of international text data. Use tf.strings.unicode_split for character tokenization and tf.strings.unicode_decode_with_offsets when you need precise byte positioning for text alignment tasks.

AmitDiwan

Updated on: 2026-03-25T16:06:48+05:30

861 Views

Previous Next