Article Categories

Selected Reading

How can Keras be used to download and explore the dataset associated with predicting tag for a stackoverflow question in Python?

Keras Python Server Side Programming Programming

TensorFlow is a machine learning framework provided by Google. It is an open-source framework used with Python to implement algorithms, deep learning applications, and much more. Keras, a high-level deep learning API within TensorFlow, provides easy access to datasets and preprocessing tools.

The Stack Overflow dataset contains question titles and their corresponding tags, making it perfect for multi-class text classification tasks. We can use Keras utilities to download and explore this dataset efficiently.

Installation

First, install the required packages ?

pip install tensorflow
pip install tensorflow-text

Downloading the Stack Overflow Dataset

Keras provides the utils.get_file() function to download datasets directly from URLs ?

import pathlib
import tensorflow as tf
from tensorflow.keras import utils

print("Downloading Stack Overflow dataset...")

data_url = 'https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz'
dataset = utils.get_file(
    'stack_overflow_16k.tar.gz',
    data_url,
    untar=True,
    cache_dir='stack_overflow',
    cache_subdir=''
)

dataset_dir = pathlib.Path(dataset).parent
print(f"Dataset downloaded to: {dataset_dir}")

Downloading Stack Overflow dataset...
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz
6053888/6053168 [==============================] - 2s 0us/step
Dataset downloaded to: stack_overflow

Exploring the Dataset Structure

Let's examine the downloaded dataset structure ?

import pathlib

dataset_dir = pathlib.Path('stack_overflow/stack_overflow_16k')

# List all directories
print("Dataset structure:")
for item in dataset_dir.iterdir():
    if item.is_dir():
        print(f"Directory: {item.name}")
        # Count files in each directory
        file_count = len(list(item.glob('*.txt')))
        print(f"  Files: {file_count}")

Dataset structure:
Directory: train
  Files: 8000
Directory: test
  Files: 8000

Loading and Examining Sample Data

The dataset contains question titles organized by programming language tags ?

import pathlib

dataset_dir = pathlib.Path('stack_overflow/stack_overflow_16k')
train_dir = dataset_dir / 'train'

# List available tags (subdirectories)
tags = [item.name for item in train_dir.iterdir() if item.is_dir()]
print(f"Available tags: {sorted(tags)}")

# Read a sample question from each tag
print("\nSample questions:")
for tag in sorted(tags)[:3]:  # Show first 3 tags
    tag_dir = train_dir / tag
    sample_file = list(tag_dir.glob('*.txt'))[0]
    
    with open(sample_file, 'r', encoding='utf-8') as f:
        question = f.read().strip()
    
    print(f"\nTag: {tag}")
    print(f"Question: {question[:100]}...")

Available tags: ['csharp', 'java', 'javascript', 'python']

Sample questions:

Tag: csharp
Question: How to add a reference to a type in another assembly/namespace in C#?...

Tag: java
Question: How can I convert a stack trace to a string?...

Tag: javascript
Question: How to check if a string contains a substring in JavaScript?...

Dataset Statistics

Let's analyze the distribution of questions across different tags ?

import pathlib

dataset_dir = pathlib.Path('stack_overflow/stack_overflow_16k')

for split in ['train', 'test']:
    split_dir = dataset_dir / split
    print(f"\n{split.upper()} set statistics:")
    
    total_questions = 0
    for tag_dir in split_dir.iterdir():
        if tag_dir.is_dir():
            count = len(list(tag_dir.glob('*.txt')))
            total_questions += count
            print(f"  {tag_dir.name}: {count} questions")
    
    print(f"  Total: {total_questions} questions")

TRAIN set statistics:
  csharp: 2000 questions
  java: 2000 questions  
  javascript: 2000 questions
  python: 2000 questions
  Total: 8000 questions

TEST set statistics:
  csharp: 2000 questions
  java: 2000 questions
  javascript: 2000 questions  
  python: 2000 questions
  Total: 8000 questions

Creating TensorFlow Dataset

Convert the downloaded files into a TensorFlow dataset for training ?

import tensorflow as tf

dataset_dir = 'stack_overflow/stack_overflow_16k'
batch_size = 32

# Create training dataset
train_ds = tf.keras.utils.text_dataset_from_directory(
    dataset_dir + '/train',
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=42
)

# Create validation dataset  
val_ds = tf.keras.utils.text_dataset_from_directory(
    dataset_dir + '/train',
    batch_size=batch_size,
    validation_split=0.2,
    subset='validation', 
    seed=42
)

print(f"Training batches: {tf.data.experimental.cardinality(train_ds)}")
print(f"Validation batches: {tf.data.experimental.cardinality(val_ds)}")
print(f"Class names: {train_ds.class_names}")

Found 8000 files belonging to 4 classes.
Using 6400 files for training.
Found 8000 files belonging to 4 classes.  
Using 1600 files for validation.
Training batches: 200
Validation batches: 50
Class names: ['csharp', 'java', 'javascript', 'python']

Conclusion

Keras makes it simple to download and explore datasets using utils.get_file() for downloading and text_dataset_from_directory() for creating TensorFlow datasets. The Stack Overflow dataset provides 16,000 balanced examples across four programming languages, perfect for text classification experiments.

AmitDiwan

Updated on: 2026-03-25T14:54:46+05:30

342 Views

Previous Next