Article Categories

Selected Reading

How can Tensorflow be used to work with character substring in Python?

Python Server Side Programming Programming Tensorflow

TensorFlow provides powerful string manipulation capabilities through the tf.strings module. The tf.strings.substr function allows you to extract character substrings from TensorFlow string tensors, with support for both byte-level and Unicode character-level operations.

Basic Substring Extraction

Let's start with a simple example of extracting substrings from a TensorFlow string tensor ?

import tensorflow as tf

# Create a string tensor
text = tf.constant("Hello TensorFlow")

# Extract substring: position 6, length 10
substring = tf.strings.substr(text, pos=6, len=10)
print("Original text:", text.numpy().decode('utf-8'))
print("Substring:", substring.numpy().decode('utf-8'))

Original text: Hello TensorFlow
Substring: TensorFlow

Working with Unicode Characters

TensorFlow handles Unicode strings efficiently. Here's how to work with Unicode characters using the unit parameter ?

import tensorflow as tf

# Unicode string with special characters
thanks = tf.constant("Thanks ?")

print("The default unit is byte")
print("When len is 1, a single byte is returned")
byte_result = tf.strings.substr(thanks, pos=7, len=1)
print("Byte result:", byte_result.numpy())

print("\nThe unit is specified as UTF8_CHAR")
print("It takes up 4 bytes for emoji")
char_result = tf.strings.substr(thanks, pos=7, len=1, unit='UTF8_CHAR')
print("UTF8_CHAR result:", char_result.numpy().decode('utf-8'))

The default unit is byte
When len is 1, a single byte is returned
Byte result: b'\xf0'

The unit is specified as UTF8_CHAR
It takes up 4 bytes for emoji
UTF8_CHAR result: ?

Multiple String Processing

You can process multiple strings simultaneously using TensorFlow tensors ?

import tensorflow as tf

# Multiple strings
strings = tf.constant(["Hello World", "Python Programming", "TensorFlow Rocks"])

# Extract first 5 characters from each string
substrings = tf.strings.substr(strings, pos=0, len=5)

print("Original strings:")
for i, s in enumerate(strings.numpy()):
    print(f"  {i}: {s.decode('utf-8')}")

print("\nSubstrings (first 5 chars):")
for i, s in enumerate(substrings.numpy()):
    print(f"  {i}: {s.decode('utf-8')}")

Original strings:
  0: Hello World
  1: Python Programming
  2: TensorFlow Rocks

Substrings (first 5 chars):
  0: Hello
  1: Pytho
  2: Tenso

Key Parameters

Parameter	Description	Default
`input`	String tensor to extract from	Required
`pos`	Starting position	Required
`len`	Length of substring	Required
`unit`	'BYTE' or 'UTF8_CHAR'	'BYTE'

How It Works

The tf.strings.substr operation takes the "unit" parameter to determine offset interpretation
When unit='BYTE', positions and lengths are counted in bytes
When unit='UTF8_CHAR', positions and lengths are counted in Unicode characters
This is crucial for handling multi-byte Unicode characters correctly

Conclusion

TensorFlow's tf.strings.substr provides efficient substring extraction with proper Unicode support. Use unit='UTF8_CHAR' for character-level operations and unit='BYTE' for byte-level processing.

AmitDiwan

Updated on: 2026-03-25T16:06:25+05:30

283 Views

Previous Next