Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How can Tensorflow be used with tf.data for finer control using Python?
The tf.data API in TensorFlow provides finer control over data preprocessing pipelines. It helps create efficient input pipelines by shuffling datasets, splitting data, and optimizing data loading for training neural networks.
Read More: What is TensorFlow and how Keras work with TensorFlow to create Neural Networks?
We will demonstrate using the flowers dataset, which contains thousands of flower images organized in 5 subdirectories (one per class). This example shows how to create a customized input pipeline with proper train-validation splitting.
We are using Google Colaboratory to run the code. Google Colab provides free access to GPUs and requires zero configuration for running TensorFlow code.
Creating a Custom Input Pipeline
First, we create a dataset from file paths and shuffle them for better training distribution ?
import tensorflow as tf
import numpy as np
from pathlib import Path
# Assume data_dir and image_count are already defined
# data_dir = Path('/root/.keras/datasets/flower_photos')
# image_count = len(list(data_dir.glob('*/*.jpg')))
print("Defining customized input pipeline")
list_ds = tf.data.Dataset.list_files(str(data_dir/'*/*'), shuffle=False)
list_ds = list_ds.shuffle(image_count, reshuffle_each_iteration=False)
for f in list_ds.take(5):
print(f.numpy())
class_names = np.array(sorted([item.name for item in data_dir.glob('*') if item.name != "LICENSE.txt"]))
print(class_names)
print("The dataset is split into training and validation set")
val_size = int(image_count * 0.2)
train_ds = list_ds.skip(val_size)
val_ds = list_ds.take(val_size)
print("Length of each subset is displayed below")
print(tf.data.experimental.cardinality(train_ds).numpy())
print(tf.data.experimental.cardinality(val_ds).numpy())
Code credit: https://www.tensorflow.org/tutorials/load_data/images
Output
Defining customized input pipeline b'/root/.keras/datasets/flower_photos/dandelion/14306875733_61d71c64c0_n.jpg' b'/root/.keras/datasets/flower_photos/dandelion/8935477500_89f22cca03_n.jpg' b'/root/.keras/datasets/flower_photos/sunflowers/3001531316_efae24d37d_n.jpg' b'/root/.keras/datasets/flower_photos/daisy/7133935763_82b17c8e1b_n.jpg' b'/root/.keras/datasets/flower_photos/tulips/17844723633_da85357fe3.jpg' ['daisy' 'dandelion' 'roses' 'sunflowers' 'tulips'] The dataset is split into training and validation set Length of each subset is displayed below 2936 734
How It Works
The pipeline works in several steps ?
-
File Listing:
tf.data.Dataset.list_files()creates a dataset of file paths from the directory structure -
Shuffling:
shuffle()randomizes the file order for better training distribution - Class Extraction: Directory names become class labels automatically
-
Train-Val Split:
skip()andtake()methods split data into training (80%) and validation (20%) sets
Key Benefits
Using tf.data provides several advantages over keras.preprocessing ?
- Fine Control: Custom preprocessing, augmentation, and batching strategies
- Performance: Optimized data loading with prefetching and parallel processing
- Flexibility: Easy integration with complex data pipelines
- Memory Efficiency: Lazy loading prevents memory overflow with large datasets
Conclusion
tf.data provides powerful tools for building efficient input pipelines with fine-grained control. It enables custom data preprocessing, optimal performance through parallel loading, and seamless integration with TensorFlow training workflows.
