Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How can Tensorflow be used to work with character substring in Python?
TensorFlow provides powerful string manipulation capabilities through the tf.strings module. The tf.strings.substr function allows you to extract character substrings from TensorFlow string tensors, with support for both byte-level and Unicode character-level operations.
Read More: What is TensorFlow and how Keras work with TensorFlow to create Neural Networks?
Basic Substring Extraction
Let's start with a simple example of extracting substrings from a TensorFlow string tensor ?
import tensorflow as tf
# Create a string tensor
text = tf.constant("Hello TensorFlow")
# Extract substring: position 6, length 10
substring = tf.strings.substr(text, pos=6, len=10)
print("Original text:", text.numpy().decode('utf-8'))
print("Substring:", substring.numpy().decode('utf-8'))
Original text: Hello TensorFlow Substring: TensorFlow
Working with Unicode Characters
TensorFlow handles Unicode strings efficiently. Here's how to work with Unicode characters using the unit parameter ?
import tensorflow as tf
# Unicode string with special characters
thanks = tf.constant("Thanks ?")
print("The default unit is byte")
print("When len is 1, a single byte is returned")
byte_result = tf.strings.substr(thanks, pos=7, len=1)
print("Byte result:", byte_result.numpy())
print("\nThe unit is specified as UTF8_CHAR")
print("It takes up 4 bytes for emoji")
char_result = tf.strings.substr(thanks, pos=7, len=1, unit='UTF8_CHAR')
print("UTF8_CHAR result:", char_result.numpy().decode('utf-8'))
The default unit is byte When len is 1, a single byte is returned Byte result: b'\xf0' The unit is specified as UTF8_CHAR It takes up 4 bytes for emoji UTF8_CHAR result: ?
Multiple String Processing
You can process multiple strings simultaneously using TensorFlow tensors ?
import tensorflow as tf
# Multiple strings
strings = tf.constant(["Hello World", "Python Programming", "TensorFlow Rocks"])
# Extract first 5 characters from each string
substrings = tf.strings.substr(strings, pos=0, len=5)
print("Original strings:")
for i, s in enumerate(strings.numpy()):
print(f" {i}: {s.decode('utf-8')}")
print("\nSubstrings (first 5 chars):")
for i, s in enumerate(substrings.numpy()):
print(f" {i}: {s.decode('utf-8')}")
Original strings: 0: Hello World 1: Python Programming 2: TensorFlow Rocks Substrings (first 5 chars): 0: Hello 1: Pytho 2: Tenso
Key Parameters
| Parameter | Description | Default |
|---|---|---|
input |
String tensor to extract from | Required |
pos |
Starting position | Required |
len |
Length of substring | Required |
unit |
'BYTE' or 'UTF8_CHAR' | 'BYTE' |
How It Works
- The
tf.strings.substroperation takes the "unit" parameter to determine offset interpretation - When
unit='BYTE', positions and lengths are counted in bytes - When
unit='UTF8_CHAR', positions and lengths are counted in Unicode characters - This is crucial for handling multi-byte Unicode characters correctly
Conclusion
TensorFlow's tf.strings.substr provides efficient substring extraction with proper Unicode support. Use unit='UTF8_CHAR' for character-level operations and unit='BYTE' for byte-level processing.
