Article Categories

Selected Reading

How can the preprocessed data be shuffled using Tensorflow and Python?

Python Server Side Programming Programming

TensorFlow is a machine learning framework provided by Google. It is an open-source framework used with Python to implement algorithms, deep learning applications, and much more. It uses NumPy and multi-dimensional arrays called tensors for efficient mathematical operations.

The tensorflow package can be installed on Windows using the following command ?

pip install tensorflow

In this tutorial, we'll demonstrate how to shuffle preprocessed data using TensorFlow's dataset operations. We'll use the Iliad dataset containing text translations from William Cowper, Edward (Earl of Derby), and Samuel Butler.

Dataset Preparation

The text files have been preprocessed to remove document headers, footers, line numbers, and chapter titles. The model will be trained to identify the translator from a single line of text.

Shuffling Preprocessed Data

Here's how to combine labeled datasets and shuffle them using TensorFlow ?

print("Combine the labelled dataset and reshuffle it")
BUFFER_SIZE = 50000
BATCH_SIZE = 64
VALIDATION_SIZE = 5000

# Combine all labeled datasets
all_labeled_data = labeled_data_sets[0]
for labeled_dataset in labeled_data_sets[1:]:
    all_labeled_data = all_labeled_data.concatenate(labeled_dataset)

# Shuffle the combined dataset
all_labeled_data = all_labeled_data.shuffle(
    BUFFER_SIZE, reshuffle_each_iteration=False)

print("Displaying a few samples of input data")
for text, label in all_labeled_data.take(8):
    print("The sentence is : ", text.numpy())
    print("The label is :", label.numpy())

The output of the above code is ?

Combine the labelled dataset and reshuffle it
Displaying a few samples of input data
The sentence is : b'But I have now both tasted food, and given'
The label is : 0
The sentence is : b'All these shall now be thine: but if the Gods'
The label is : 1
The sentence is : b'Their spiry summits waved. There, unperceived'
The label is : 0
The sentence is : b'"I pray you, would you show your love, dear friends,'
The label is : 1
The sentence is : b'Entering beneath the clavicle the point'
The label is : 0
The sentence is : b'But grief, his father lost, awaits him now,'
The label is : 1
The sentence is : b'in the fore-arm where the sinews of the elbow are united, whereon he'
The label is : 2
The sentence is : b'For, as I think, I have already chased'
The label is : 0

Key Parameters

BUFFER_SIZE (50000): Size of the shuffle buffer. Larger values provide better randomization but use more memory.
reshuffle_each_iteration=False: Maintains the same shuffle order across epochs for reproducibility.
Labels: 0, 1, 2 represent the three different translators (Cowper, Derby, Butler).

How Shuffling Works

The shuffle() method randomly reorders dataset elements within the specified buffer size. This prevents the model from learning patterns based on data order, improving training effectiveness.

Conclusion

Use TensorFlow's shuffle() method to randomize dataset order for better model training. The buffer size controls randomization quality while concatenate() combines multiple datasets before shuffling.

AmitDiwan

Updated on: 2026-03-25T15:25:39+05:30

244 Views

Previous Next