Article Categories

Selected Reading

Load Text in Tensorflow

Python Tensorflow Server Side Programming Programming

TensorFlow is a powerful open-source framework developed by Google that excels at handling various types of data, including text. Loading and processing text data efficiently is crucial for natural language processing tasks like sentiment analysis, text classification, and language translation.

Understanding Text Data in TensorFlow

Text data is unstructured and requires special handling before it can be used in machine learning models. TensorFlow provides the tf.data API with specialized classes like TextLineDataset to streamline text data loading and preprocessing operations.

Installing TensorFlow

Before working with text data, ensure TensorFlow is installed ?

pip install tensorflow

Loading Single Text Files

The TextLineDataset class reads text files line by line, treating each line as a separate data sample ?

import tensorflow as tf

# Create sample text data
sample_text = """Hello world
This is line two
Machine learning is awesome
TensorFlow makes it easy
Text processing with Python"""

# Write to a file for demonstration
with open('sample.txt', 'w') as f:
    f.write(sample_text)

# Load the text file
dataset = tf.data.TextLineDataset("sample.txt")

# Display first 3 lines
for line in dataset.take(3):
    print(line.numpy().decode('utf-8'))

Hello world
This is line two
Machine learning is awesome

Loading Multiple Text Files

You can load multiple text files simultaneously by passing a list of filenames ?

import tensorflow as tf

# Create multiple sample files
texts = [
    "First file content\nLine 2 of file 1",
    "Second file content\nLine 2 of file 2", 
    "Third file content\nLine 2 of file 3"
]

filenames = []
for i, text in enumerate(texts):
    filename = f'file_{i+1}.txt'
    with open(filename, 'w') as f:
        f.write(text)
    filenames.append(filename)

# Load multiple files
dataset = tf.data.TextLineDataset(filenames)

print("Lines from all files:")
for line in dataset.take(6):
    print(line.numpy().decode('utf-8'))

Lines from all files:
First file content
Line 2 of file 1
Second file content
Line 2 of file 2
Third file content
Line 2 of file 3

Batch Processing for Large Files

For large text files, use batching to process data in manageable chunks ?

import tensorflow as tf

# Create a larger sample file
large_text = "\n".join([f"This is line number {i+1}" for i in range(20)])

with open('large_sample.txt', 'w') as f:
    f.write(large_text)

# Load and batch the data
dataset = tf.data.TextLineDataset("large_sample.txt")
batched_dataset = dataset.batch(5)

print("First batch:")
for batch in batched_dataset.take(1):
    for line in batch:
        print(line.numpy().decode('utf-8'))
    print()

print("Second batch:")
for batch in batched_dataset.skip(1).take(1):
    for line in batch:
        print(line.numpy().decode('utf-8'))

First batch:
This is line number 1
This is line number 2
This is line number 3
This is line number 4
This is line number 5

Second batch:
This is line number 6
This is line number 7
This is line number 8
This is line number 9
This is line number 10

Text Preprocessing Pipeline

Combine loading with preprocessing operations for a complete text processing pipeline ?

import tensorflow as tf

# Sample text with mixed case and punctuation
mixed_text = """Hello World!
TENSORFLOW is Great
machine learning rocks"""

with open('mixed_case.txt', 'w') as f:
    f.write(mixed_text)

# Load and preprocess
dataset = tf.data.TextLineDataset("mixed_case.txt")

# Convert to lowercase and split into words
def preprocess_line(line):
    line = tf.strings.lower(line)
    words = tf.strings.split(line)
    return words

dataset = dataset.map(preprocess_line)

print("Preprocessed text:")
for words in dataset:
    print(words.numpy())

Preprocessed text:
[b'hello' b'world!']
[b'tensorflow' b'is' b'great']
[b'machine' b'learning' b'rocks']

Conclusion

TensorFlow's TextLineDataset provides an efficient way to load text data for machine learning pipelines. Whether working with single files, multiple files, or large datasets requiring batch processing, TensorFlow's data loading capabilities streamline text preprocessing and model training workflows.

---

Siva Sai

Updated on: 2026-03-27T08:25:17+05:30

472 Views

Previous Next