Article Categories

Selected Reading

How can Tensorflow be used to download and explore IMDB dataset in Python?

Python Server Side Programming Programming

TensorFlow is a machine learning framework that is provided by Google. It is an open-source framework used in conjunction with Python to implement algorithms, deep learning applications and much more. It is used in research and for production purposes.

This is because it uses NumPy and multi-dimensional arrays. These multi-dimensional arrays are also known as tensors. The framework supports working with deep neural network. It is highly scalable, and comes with many popular datasets. It uses GPU computation and automates the management of resources. It comes with multitude of machine learning libraries, and is well-supported and documented.

The IMDB dataset contains reviews of over 50 thousand movies. This dataset is generally used with operations associated with Natural Language Processing for sentiment analysis tasks.

Installing TensorFlow

The tensorflow package can be installed on Windows using the below line of code ?

pip install tensorflow

Downloading and Loading the IMDB Dataset

We can download the IMDB dataset directly using TensorFlow's utilities. The following example shows how to download, extract, and explore the dataset ?

import matplotlib.pyplot as plt
import os
import re
import shutil
import string
import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import preprocessing

print("The tensorflow version is")
print(tf.__version__)

url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

dataset = tf.keras.utils.get_file("aclImdb_v1.tar.gz", url,
                                 untar=True, cache_dir='.',
                                 cache_subdir='')
print("The dataset is being downloaded")

dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')
print("The directories in the downloaded folder are:")
print(os.listdir(dataset_dir))

The tensorflow version is
2.15.0
The dataset is being downloaded
The directories in the downloaded folder are:
['imdb.vocab', 'imdbEr.txt', 'README', 'test', 'train']

Exploring the Dataset Structure

Let's examine the training directory and look at a sample review ?

train_dir = os.path.join(dataset_dir, 'train')
print("Contents of train directory:")
print(os.listdir(train_dir))

print("\nThe sample of data:")
sample_file = os.path.join(train_dir, 'pos/1181_9.txt')
with open(sample_file) as f:
    print(f.read())

# Remove unsupervised directory as we don't need it
remove_dir = os.path.join(train_dir, 'unsup')
shutil.rmtree(remove_dir)

Contents of train directory:
['neg', 'pos', 'unsup']

The sample of data:
Rachel Griffiths writes and directs this award winning short film. A heartwarming story about coping with grief and cherishing the memory of those we've loved and lost. Although, only 15 minutes long, Griffiths manages to capture so much emotion and truth onto film in the short space of time. Bud Tingwell gives a touching performance as Will, a widower struggling to cope with his wife's death. Will is confronted by the harsh reality of loneliness and helplessness as he proceeds to take care of Ruth's pet cow, Tulip. The film displays the grief and responsibility one feels for those they have loved and lost. Good cinematography, great direction, and superbly acted. It will bring tears to all those who have lost a loved one, and survived.

Creating Training and Validation Datasets

We'll create datasets for training, validation, and testing using TensorFlow's text_dataset_from_directory function ?

batch_size = 32
seed = 42
print("The batch size is:", batch_size)

raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train',
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=seed)

# Display sample reviews and labels
for text_batch, label_batch in raw_train_ds.take(1):
    for i in range(3):
        print("Review", text_batch.numpy()[i][:200], "...")
        print("Label", label_batch.numpy()[i])
        print()

print("Label 0 corresponds to", raw_train_ds.class_names[0])
print("Label 1 corresponds to", raw_train_ds.class_names[1])

The batch size is: 32
Found 25000 files belonging to 2 classes.
Using 20000 files for training.

Review b'"Pandemonium" is a horror movie spoof that comes off more stupid than funny. Believe me when I tell you, I love comedies. Especially comedy spoofs. "Airplane", "The Naked Gun" trilogy, "Blazing Saddles", "High Anxiety", and "Sp' ...
Label 0

Review b'David Mamet is a very interesting and a very un-equal director. His first movie 'House of Games' was the one I liked best, and it set a series of films with characters whose perspective of life changes as they get into co' ...
Label 0

Review b'Great documentary about the lives of NY firefighters during the worst terrorist attack of all time.. That reason alone is why this should be a must see collectors item.. What shocked me was not only the attacks, but the"High' ...
Label 1

Label 0 corresponds to neg
Label 1 corresponds to pos

Creating Validation and Test Datasets

Complete the dataset preparation by creating validation and test sets ?

raw_val_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train',
    batch_size=batch_size,
    validation_split=0.2,
    subset='validation',
    seed=seed)

raw_test_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/test',
    batch_size=batch_size)

print("Dataset preparation complete!")
print(f"Training samples: ~20,000")
print(f"Validation samples: ~5,000") 
print(f"Test samples: ~25,000")

Found 25000 files belonging to 2 classes.
Using 5000 files for validation.
Found 25000 files belonging to 2 classes.
Dataset preparation complete!
Training samples: ~20,000
Validation samples: ~5,000
Test samples: ~25,000

Key Points

The IMDB dataset contains movie reviews labeled as positive (1) or negative (0)
TensorFlow's get_file() function automatically downloads and extracts the dataset
The dataset is split into training (80%) and validation (20%) sets
Each review is stored as a separate text file in pos/ or neg/ directories
The text_dataset_from_directory() function creates TensorFlow datasets ready for training

Conclusion

TensorFlow provides convenient utilities to download and prepare the IMDB dataset for sentiment analysis tasks. The dataset can be easily loaded, split into training/validation sets, and converted into TensorFlow datasets ready for model training.

---

AmitDiwan

Updated on: 2026-03-25T15:32:48+05:30

458 Views

Previous Next