How can Tensorflow and Python be used to build ragged tensor from list of words?

TensorFlow's RaggedTensor is useful for handling sequences of variable lengths. You can build a ragged tensor from a list of words by using starting offsets to group character code points by word boundaries.

Read More: What is TensorFlow and how Keras work with TensorFlow to create Neural Networks?

This approach is particularly useful when working with Unicode strings where you need to manipulate text data at the character level while maintaining word boundaries.

Prerequisites

We'll use Google Colaboratory which provides free access to GPUs and requires zero configuration. It's built on top of Jupyter Notebook.

Building RaggedTensor from Word Lists

Here's how to create a ragged tensor by converting words to character code points ?

import tensorflow as tf

# Sample sentence with mixed languages
sentence = "Hello, there. ?? ?????"

# Convert to Unicode code points
sentence_chars = tf.strings.unicode_split(sentence, 'UTF-8')
print("Characters:", sentence_chars)

# Get character code points
sentence_char_codepoint = tf.strings.unicode_decode(sentence, 'UTF-8')
print("Code points:", sentence_char_codepoint)
Characters: tf.Tensor([b'H' b'e' b'l' b'l' b'o' b',' b' ' b't' b'h' b'e' b'r' b'e' b'.' b' ' b'\xe4\xb8\xad' b'\xe5\x9b\xbd'], shape=(16,), dtype=string)
Code points: tf.Tensor([   72   101   108   108   111    44    32   116   104   101   114   101    46    32 19990 30028], shape=(16,), dtype=int32)

Creating Word-Level RaggedTensor

Now we'll group the character code points by word boundaries ?

import tensorflow as tf

# Define word boundaries (start positions)
word_starts = [0, 5, 7, 12, 14, 16]  # Positions where new words start

# Sample code points from previous example
sentence_char_codepoint = tf.constant([72, 101, 108, 108, 111, 44, 32, 116, 104, 101, 114, 101, 46, 32, 19990, 30028])

print("Get the code point of every character in every word")
word_char_codepoint = tf.RaggedTensor.from_row_starts(
    values=sentence_char_codepoint,
    row_starts=word_starts)
print(word_char_codepoint)

print("Get the number of characters in each word")
chars_per_word = tf.reduce_sum(tf.ones_like(word_char_codepoint), axis=1)
print("Characters per word:", chars_per_word)
Get the code point of every character in every word
<tf.RaggedTensor [[72, 101, 108, 108, 111], [44, 32], [116, 104, 101, 114, 101], [46], [19990, 30028]]>
Get the number of characters in each word
Characters per word: tf.Tensor([5 2 5 1 2], shape=(5,), dtype=int32)

Working with Multiple Sentences

You can also create ragged tensors for multiple sentences with varying word counts ?

import tensorflow as tf

# Multiple sentences with different lengths
sentences = ["Hello world", "TensorFlow rocks", "AI ML DL"]

# Convert each sentence to character code points
all_codepoints = []
sentence_lengths = []

for sentence in sentences:
    codepoints = tf.strings.unicode_decode(sentence, 'UTF-8')
    all_codepoints.append(codepoints)
    sentence_lengths.append(len(codepoints))

# Create ragged tensor from list of tensors
ragged_sentences = tf.ragged.constant(all_codepoints)
print("Ragged tensor for multiple sentences:")
print(ragged_sentences)

print("\nSentence lengths:", sentence_lengths)
Ragged tensor for multiple sentences:
<tf.RaggedTensor [[72, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100], [84, 101, 110, 115, 111, 114, 70, 108, 111, 119, 32, 114, 111, 99, 107, 115], [65, 73, 32, 77, 76, 32, 68, 76]]>

Sentence lengths: [11, 16, 8]

Key Benefits

  • Memory Efficient: No padding required for variable-length sequences
  • Unicode Support: Handles multi-language text with proper character encoding
  • Flexible Operations: Supports standard tensor operations on irregular data
  • Word Boundary Preservation: Maintains semantic structure of text data

Conclusion

RaggedTensor allows efficient handling of variable-length text data by preserving word boundaries through character code points. This approach is essential for NLP tasks involving multilingual text processing.

Updated on: 2026-03-25T16:08:30+05:30

374 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements