How can Tensorflow be used to work with character substring in Python?

TensorFlow provides powerful string manipulation capabilities through the tf.strings module. The tf.strings.substr function allows you to extract character substrings from TensorFlow string tensors, with support for both byte-level and Unicode character-level operations.

Read More: What is TensorFlow and how Keras work with TensorFlow to create Neural Networks?

Basic Substring Extraction

Let's start with a simple example of extracting substrings from a TensorFlow string tensor ?

import tensorflow as tf

# Create a string tensor
text = tf.constant("Hello TensorFlow")

# Extract substring: position 6, length 10
substring = tf.strings.substr(text, pos=6, len=10)
print("Original text:", text.numpy().decode('utf-8'))
print("Substring:", substring.numpy().decode('utf-8'))
Original text: Hello TensorFlow
Substring: TensorFlow

Working with Unicode Characters

TensorFlow handles Unicode strings efficiently. Here's how to work with Unicode characters using the unit parameter ?

import tensorflow as tf

# Unicode string with special characters
thanks = tf.constant("Thanks ?")

print("The default unit is byte")
print("When len is 1, a single byte is returned")
byte_result = tf.strings.substr(thanks, pos=7, len=1)
print("Byte result:", byte_result.numpy())

print("\nThe unit is specified as UTF8_CHAR")
print("It takes up 4 bytes for emoji")
char_result = tf.strings.substr(thanks, pos=7, len=1, unit='UTF8_CHAR')
print("UTF8_CHAR result:", char_result.numpy().decode('utf-8'))
The default unit is byte
When len is 1, a single byte is returned
Byte result: b'\xf0'

The unit is specified as UTF8_CHAR
It takes up 4 bytes for emoji
UTF8_CHAR result: ?

Multiple String Processing

You can process multiple strings simultaneously using TensorFlow tensors ?

import tensorflow as tf

# Multiple strings
strings = tf.constant(["Hello World", "Python Programming", "TensorFlow Rocks"])

# Extract first 5 characters from each string
substrings = tf.strings.substr(strings, pos=0, len=5)

print("Original strings:")
for i, s in enumerate(strings.numpy()):
    print(f"  {i}: {s.decode('utf-8')}")

print("\nSubstrings (first 5 chars):")
for i, s in enumerate(substrings.numpy()):
    print(f"  {i}: {s.decode('utf-8')}")
Original strings:
  0: Hello World
  1: Python Programming
  2: TensorFlow Rocks

Substrings (first 5 chars):
  0: Hello
  1: Pytho
  2: Tenso

Key Parameters

Parameter Description Default
input String tensor to extract from Required
pos Starting position Required
len Length of substring Required
unit 'BYTE' or 'UTF8_CHAR' 'BYTE'

How It Works

  • The tf.strings.substr operation takes the "unit" parameter to determine offset interpretation
  • When unit='BYTE', positions and lengths are counted in bytes
  • When unit='UTF8_CHAR', positions and lengths are counted in Unicode characters
  • This is crucial for handling multi-byte Unicode characters correctly

Conclusion

TensorFlow's tf.strings.substr provides efficient substring extraction with proper Unicode support. Use unit='UTF8_CHAR' for character-level operations and unit='BYTE' for byte-level processing.

Updated on: 2026-03-25T16:06:25+05:30

247 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements