How can Tensorflow be used in the conversion between different string representations?

TensorFlow provides powerful string manipulation functions for converting between different Unicode string representations. The tf.strings module offers three key methods: unicode_decode to convert encoded strings to code point vectors, unicode_encode to convert code points back to encoded strings, and unicode_transcode to convert between different encodings.

Setting Up the Data

First, let's create some sample Unicode text to work with ?

import tensorflow as tf

# Sample Unicode text
text_utf8 = tf.constant("????")
print("Original UTF-8 text:", text_utf8)

# Convert to code points for demonstration
text_chars = tf.strings.unicode_decode(text_utf8, input_encoding='UTF-8')
print("Code points:", text_chars)
Original UTF-8 text: tf.Tensor(b'\xe8\xaf\xad\xe8\xa8\x80\xe5\xa4\x84\xe7\x90\x86', shape=(), dtype=string)
Code points: tf.Tensor([35821 35328 22788 29702], shape=(4,), dtype=int32)

Converting Encoded String to Code Points

The unicode_decode function converts an encoded string scalar to a vector of Unicode code points ?

import tensorflow as tf

text_utf8 = tf.constant("Hello ??")
print("Converting encoded string scalar to a vector of code points")

# Decode UTF-8 string to Unicode code points
code_points = tf.strings.unicode_decode(text_utf8, input_encoding='UTF-8')
print("Code points:", code_points)
print("Shape:", code_points.shape)
Converting encoded string scalar to a vector of code points
Code points: tf.Tensor([ 72 101 108 108 111  32 19990 30028], shape=(8,), dtype=int32)
Shape: (8,)

Converting Code Points to Encoded String

The unicode_encode function converts a vector of Unicode code points back to an encoded string scalar ?

import tensorflow as tf

# Sample code points
code_points = tf.constant([72, 101, 108, 108, 111, 32, 19990, 30028])
print("Converting vector of code points to an encoded string scalar")

# Encode code points to UTF-8 string
encoded_string = tf.strings.unicode_encode(code_points, output_encoding='UTF-8')
print("Encoded string:", encoded_string)
Converting vector of code points to an encoded string scalar
Encoded string: tf.Tensor(b'Hello \xe4\xb8\x96\xe7\x95\x8c', shape=(), dtype=string)

Converting Between Different Encodings

The unicode_transcode function converts an encoded string from one encoding to another ?

import tensorflow as tf

text_utf8 = tf.constant("????")
print("Converting encoded string scalar to a different encoding")

# Transcode from UTF-8 to UTF-16-BE
transcoded = tf.strings.unicode_transcode(text_utf8, input_encoding='UTF-8', output_encoding='UTF-16-BE')
print("UTF-16-BE encoded:", transcoded)

# Transcode back to UTF-8
back_to_utf8 = tf.strings.unicode_transcode(transcoded, input_encoding='UTF-16-BE', output_encoding='UTF-8')
print("Back to UTF-8:", back_to_utf8)
Converting encoded string scalar to a different encoding
UTF-16-BE encoded: tf.Tensor(b'\x8b\xed\x8a\x00Y\x04t\x06', shape=(), dtype=string)
Back to UTF-8: tf.Tensor(b'\xe8\xaf\xad\xe8\xa8\x80\xe5\xa4\x84\xe7\x90\x86', shape=(), dtype=string)

Complete Example

Here's a comprehensive example demonstrating all three conversion methods ?

import tensorflow as tf

# Original text
original_text = tf.constant("Python ?")
print("Original text:", original_text)

# Step 1: Decode to code points
code_points = tf.strings.unicode_decode(original_text, input_encoding='UTF-8')
print("Code points:", code_points)

# Step 2: Encode back to UTF-8
reconstructed = tf.strings.unicode_encode(code_points, output_encoding='UTF-8')
print("Reconstructed:", reconstructed)

# Step 3: Transcode to UTF-16-BE
utf16_encoded = tf.strings.unicode_transcode(original_text, input_encoding='UTF-8', output_encoding='UTF-16-BE')
print("UTF-16-BE:", utf16_encoded)
Original text: tf.Tensor(b'Python \xf0\x9f\x90\x8d', shape=(), dtype=string)
Code points: tf.Tensor([ 80 121 116 104 111 110  32 128013], shape=(8,), dtype=int32)
Reconstructed: tf.Tensor(b'Python \xf0\x9f\x90\x8d', shape=(), dtype=string)
UTF-16-BE: tf.Tensor(b'\x00P\x00y\x00t\x00h\x00o\x00n\x00 \xd8=\xdc\r', shape=(), dtype=string)

Key Functions Summary

Function Purpose Input Output
unicode_decode String to code points Encoded string Vector of integers
unicode_encode Code points to string Vector of integers Encoded string
unicode_transcode Change encoding Encoded string Differently encoded string

Conclusion

TensorFlow's string conversion functions provide flexible Unicode handling for text processing tasks. Use unicode_decode and unicode_encode for working with individual code points, and unicode_transcode for converting between different string encodings.

Updated on: 2026-03-25T16:05:30+05:30

342 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements