How can Tensorflow be used with tf.data for finer control using Python?

The tf.data API in TensorFlow provides finer control over data preprocessing pipelines. It helps create efficient input pipelines by shuffling datasets, splitting data, and optimizing data loading for training neural networks.

Read More: What is TensorFlow and how Keras work with TensorFlow to create Neural Networks?

We will demonstrate using the flowers dataset, which contains thousands of flower images organized in 5 subdirectories (one per class). This example shows how to create a customized input pipeline with proper train-validation splitting.

We are using Google Colaboratory to run the code. Google Colab provides free access to GPUs and requires zero configuration for running TensorFlow code.

Creating a Custom Input Pipeline

First, we create a dataset from file paths and shuffle them for better training distribution ?

import tensorflow as tf
import numpy as np
from pathlib import Path

# Assume data_dir and image_count are already defined
# data_dir = Path('/root/.keras/datasets/flower_photos')
# image_count = len(list(data_dir.glob('*/*.jpg')))

print("Defining customized input pipeline")
list_ds = tf.data.Dataset.list_files(str(data_dir/'*/*'), shuffle=False)
list_ds = list_ds.shuffle(image_count, reshuffle_each_iteration=False)

for f in list_ds.take(5):
    print(f.numpy())

class_names = np.array(sorted([item.name for item in data_dir.glob('*') if item.name != "LICENSE.txt"]))
print(class_names)

print("The dataset is split into training and validation set")
val_size = int(image_count * 0.2)
train_ds = list_ds.skip(val_size)
val_ds = list_ds.take(val_size)
print("Length of each subset is displayed below")
print(tf.data.experimental.cardinality(train_ds).numpy())
print(tf.data.experimental.cardinality(val_ds).numpy())

Code credit: https://www.tensorflow.org/tutorials/load_data/images

Output

Defining customized input pipeline
b'/root/.keras/datasets/flower_photos/dandelion/14306875733_61d71c64c0_n.jpg'
b'/root/.keras/datasets/flower_photos/dandelion/8935477500_89f22cca03_n.jpg'
b'/root/.keras/datasets/flower_photos/sunflowers/3001531316_efae24d37d_n.jpg'
b'/root/.keras/datasets/flower_photos/daisy/7133935763_82b17c8e1b_n.jpg'
b'/root/.keras/datasets/flower_photos/tulips/17844723633_da85357fe3.jpg'
['daisy' 'dandelion' 'roses' 'sunflowers' 'tulips']
The dataset is split into training and validation set
Length of each subset is displayed below
2936
734

How It Works

The pipeline works in several steps ?

  • File Listing: tf.data.Dataset.list_files() creates a dataset of file paths from the directory structure
  • Shuffling: shuffle() randomizes the file order for better training distribution
  • Class Extraction: Directory names become class labels automatically
  • Train-Val Split: skip() and take() methods split data into training (80%) and validation (20%) sets

Key Benefits

Using tf.data provides several advantages over keras.preprocessing ?

  • Fine Control: Custom preprocessing, augmentation, and batching strategies
  • Performance: Optimized data loading with prefetching and parallel processing
  • Flexibility: Easy integration with complex data pipelines
  • Memory Efficiency: Lazy loading prevents memory overflow with large datasets

Conclusion

tf.data provides powerful tools for building efficient input pipelines with fine-grained control. It enables custom data preprocessing, optimal performance through parallel loading, and seamless integration with TensorFlow training workflows.

Updated on: 2026-03-25T16:01:04+05:30

248 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements