Article Categories

Selected Reading

How can Tensorflow be used with tf.data for finer control using Python?

Python Server Side Programming Programming Tensorflow

The tf.data API in TensorFlow provides finer control over data preprocessing pipelines. It helps create efficient input pipelines by shuffling datasets, splitting data, and optimizing data loading for training neural networks.

We will demonstrate using the flowers dataset, which contains thousands of flower images organized in 5 subdirectories (one per class). This example shows how to create a customized input pipeline with proper train-validation splitting.

We are using Google Colaboratory to run the code. Google Colab provides free access to GPUs and requires zero configuration for running TensorFlow code.

Creating a Custom Input Pipeline

First, we create a dataset from file paths and shuffle them for better training distribution ?

import tensorflow as tf
import numpy as np
from pathlib import Path

# Assume data_dir and image_count are already defined
# data_dir = Path('/root/.keras/datasets/flower_photos')
# image_count = len(list(data_dir.glob('*/*.jpg')))

print("Defining customized input pipeline")
list_ds = tf.data.Dataset.list_files(str(data_dir/'*/*'), shuffle=False)
list_ds = list_ds.shuffle(image_count, reshuffle_each_iteration=False)

for f in list_ds.take(5):
    print(f.numpy())

class_names = np.array(sorted([item.name for item in data_dir.glob('*') if item.name != "LICENSE.txt"]))
print(class_names)

print("The dataset is split into training and validation set")
val_size = int(image_count * 0.2)
train_ds = list_ds.skip(val_size)
val_ds = list_ds.take(val_size)
print("Length of each subset is displayed below")
print(tf.data.experimental.cardinality(train_ds).numpy())
print(tf.data.experimental.cardinality(val_ds).numpy())

Code credit: https://www.tensorflow.org/tutorials/load_data/images

Output

Defining customized input pipeline
b'/root/.keras/datasets/flower_photos/dandelion/14306875733_61d71c64c0_n.jpg'
b'/root/.keras/datasets/flower_photos/dandelion/8935477500_89f22cca03_n.jpg'
b'/root/.keras/datasets/flower_photos/sunflowers/3001531316_efae24d37d_n.jpg'
b'/root/.keras/datasets/flower_photos/daisy/7133935763_82b17c8e1b_n.jpg'
b'/root/.keras/datasets/flower_photos/tulips/17844723633_da85357fe3.jpg'
['daisy' 'dandelion' 'roses' 'sunflowers' 'tulips']
The dataset is split into training and validation set
Length of each subset is displayed below
2936
734

How It Works

The pipeline works in several steps ?

File Listing: tf.data.Dataset.list_files() creates a dataset of file paths from the directory structure
Shuffling: shuffle() randomizes the file order for better training distribution
Class Extraction: Directory names become class labels automatically
Train-Val Split: skip() and take() methods split data into training (80%) and validation (20%) sets

Key Benefits

Using tf.data provides several advantages over keras.preprocessing ?

Fine Control: Custom preprocessing, augmentation, and batching strategies
Performance: Optimized data loading with prefetching and parallel processing
Flexibility: Easy integration with complex data pipelines
Memory Efficiency: Lazy loading prevents memory overflow with large datasets

Conclusion

tf.data provides powerful tools for building efficient input pipelines with fine-grained control. It enables custom data preprocessing, optimal performance through parallel loading, and seamless integration with TensorFlow training workflows.

AmitDiwan

Updated on: 2026-03-25T16:01:04+05:30

283 Views

Previous Next