How to use a DataLoader in PyTorch?


PyTorch is a popular open-source machine-learning library. Data scientists, researchers, and developers use this library widely to develop AI/ML products. One of the most important features of PyTorch is the Data Loader class. This class helps to load and batch the data for neural network training efficiently. This article will teach us how to use the Data Loader in PyTorch.

Use Data Loader In PyTorch

We can follow the following basic rules to perform the Data Loading operation in Python using the PyTorch library −

  • Data Preparation − Create a custom Random Dataset class that generates a random dataset of a desired size. Use Data Loader to create batches of data, specifying the batch size and enabling shuffling.

  • Neural Network Definition − Define a neural network class, Net, with two fully connected layers and an activation function. Customize the architecture based on the desired number of units in each layer

  • Initialization and Optimization − Instantiate the Net class, set the mean squared error (MSE) loss criterion, and initialize the optimizer as stochastic gradient descent (SGD) with a desired learning rate.

  • Training Loop − Iterate over the Data Loader for the desired number of epochs. For each batch of data, compute the network output, calculate the loss, back propagate the gradients, update the weights, and track the running loss

Example

The following code defines a simple neural network and a random dataset of 1000 data points with ten features each. It then creates a Data Loader from the dataset with a batch size of 32 and shuffles the data. The neural network is trained using stochastic gradient descent with a mean squared error loss function. The training loop iterates over the Data Loader for ten epochs, computing the loss for each batch of data, back propagating the gradients, and updating the network weights. The running loss is also printed every ten batches to monitor training progress.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
class RandomDataset(Dataset):
    def __init__(self, size):
        self.data = torch.randn(size, 10)
    def __len__(self):
        return len(self.data)
    def __getitem__(self, index):
        return self.data[index]
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(10, 5)
        self.fc2 = nn.Linear(5, 1)

    def forward(self, x):
        x = self.fc1(x)
        x = nn.functional.relu(x)
        x = self.fc2(x)
        return x
dataset = RandomDataset(1000)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
net = Net()
criterion = nn.MSELoss()
optimizer = optim.SGD(net.parameters(), lr=0.01)
for epoch in range(10):
    running_loss = 0.0
    for i, data in enumerate(dataloader, 0):
        inputs = data
        labels = torch.rand((data.shape[0], 1))
        optimizer.zero_grad()
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        if i % 10 == 9:
            print(f"[Epoch {epoch + 1}, Batch {i + 1}] loss: {running_loss / 10}")
            running_loss = 0.0

Output

[Epoch 1, Batch 10] loss: 0.25439725518226625
[Epoch 1, Batch 20] loss: 0.18304144889116286
[Epoch 1, Batch 30] loss: 0.1451663628220558
[Epoch 2, Batch 10] loss: 0.12896266356110572
[Epoch 2, Batch 20] loss: 0.11783223450183869
………………………………………………………..
[Epoch 10, Batch 30] loss: 0.09491728842258454

Data Sampling and Weighted Sampling

Data sampling refers to selecting only a subset of the data for execution. This is essential in machine learning and data analysis when a large amount of data cannot fit into a RAM. Sampling helps to train, test and validate batch-wise. Weighted sampling is a variant of sampling where we define some weight to the data points. This considers that the data points with more impact over the prediction have more significance

Syntax

weighted_sampler = WeightedRandomSampler(“weights in the form of array like
objects”, num_samples=len(dataset), other parameters…)
loader = DataLoader(dataset, batch_size=batch_size,
sampler=weighted_sampler, other parameters......)

Here we need to define the weight as a list or array-like objects; Weighted Random Sampler then creates the sampler. Then we need to pass the dataset to the Data Loader object. We need to use the parameter “sampler” for the weighted sampling.

Example

We implemented weighted sampling using the Data loader, Weighted Random Sampler in the following example. We passed the dataset, and batch size=32, to the Data Loader object. This means that 32 data samples are processed at a time. We used the Weighted Random Sampler method to give weights to the samples. Since we set replacement=True, the data points can be included in multiple batches.

import torch
from torch.utils.data import Dataset, DataLoader, WeightedRandomSampler
class CustomDataset(Dataset):
    def __init__(self):
        self.data = torch.randn((1000, 3, 32, 32))
        self.labels = torch.randint(0, 10, (1000,))
    def __len__(self):
        return len(self.data)
    def __getitem__(self, index):
        data_sample = self.data[index]
        label = self.labels[index]
        return data_sample, label
dataset = CustomDataset()
weights = torch.where(dataset.labels == 0, torch.tensor(2.0), torch.tensor(1.0))
sampler = WeightedRandomSampler(weights, len(dataset), replacement=True)
dataloader = DataLoader(dataset, batch_size=32, sampler=sampler)
for batch_data, batch_labels in dataloader:
    print(batch_data.shape, batch_labels.shape) 

Output

torch.Size([32, 3, 32, 32]) torch.Size([32])
torch.Size([32, 3, 32, 32]) torch.Size([32])
..................................................
torch.Size([32, 3, 32, 32]) torch.Size([32])
torch.Size([32, 3, 32, 32]) torch.Size([32])
torch.Size([8, 3, 32, 32]) torch.Size([8])

Multi-threaded Data Loading

Multi threaded loading is a process to speed up the process of loading and pre-processing the data. This technique aims to parallelized the data loading operations in multiple threads across the device, enabling it to process the execution much faster. We can enable this in PyTorch by using the num_workers parameter. The parameter takes the number of threads to be used in the form of integers

Syntax

dataloader = DataLoader( num_workers=<number of workers>, other parameters)

Here the num_workers is the number of sub processes that can occur during execution. Setting the num_works to the number of CPU threads available is common. If set to -1 it will utilize all the CPU cores available.

Example

In the following code, we have set the num_workers to be 2, meaning that the data loading and pre-processing process would occur in 2 threads in parallel. We kept the batch size to 32 and shuffle=True(shuffling would occur before creating the batches).

import torch
from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):
    def __init__(self, num_samples):
        self.data = torch.randn((num_samples, 3, 64, 64))
        self.labels = torch.randint(0, 10, (num_samples,))

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        data_sample = self.data[index]
        label = self.labels[index]
        return data_sample, label
dataset = CustomDataset(num_samples=3000)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=2)
for batch_data, batch_labels in dataloader:
    print("Batch data shape:", batch_data.shape)
    print("Batch labels shape:", batch_labels.shape)

Output

Batch data shape: torch.Size([32, 3, 64, 64])
Batch labels shape: torch.Size([32])
Batch data shape: torch.Size([32, 3, 64, 64])
.......................................................
Batch data shape: torch.Size([24, 3, 64, 64])
Batch labels shape: torch.Size([24])

Shuffling and Batch Size

As the name suggests, shuffling refers to randomly reordering the data points. This has several advantages, including the removal of bias. It is expected that the data points would be more uniform on shuffling the data, leading to better fine-tuning of the model. Batch size, conversely, refers to grouping the data points and letting them execute at once. This is important since a large amount of data may only sometimes fit within the memory

Syntax

dataloader = DataLoader(dataset, batch_size=<set a number>,
shuffle=<Boolean True or False>, other parameters...)

Here dataset is the data we need to set the batch size and shuffle the same. Batch size takes parameters in the form of whole numbers. Shuffle accepts boolean True and False as the arguments. If set to True shuffling takes place and if set to False no shuffling takes place.

Example

In the following example, we passed two important parameters to the Data Loader class, i.e., batch_size and shuffle. We set batch_size to 128, meaning that 128 data points will be executed simultaneously. shuffle=True indicates that before each execution shuffling would take place. Shuffling won't occur if set to False, and we may encounter a slightly biased model.

import torch
from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):
    def __init__(self, num_samples):
        self.data = torch.randn((num_samples, 3, 32, 32))
        self.labels = torch.randint(0, 10, (num_samples,))
    def __len__(self):
        return len(self.data)
    def __getitem__(self, index):
        data_sample = self.data[index]
        label = self.labels[index]
        return data_sample, label
dataset = CustomDataset(num_samples=1000)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=2)
for batch_data, batch_labels in dataloader:
    print("Batch data shape:", batch_data.shape)
    print("Batch labels shape:", batch_labels.shape)

Output

Batch data shape: torch.Size([128, 3, 32, 32])
Batch labels shape: torch.Size([128])
......................................................
Batch data shape: torch.Size([104, 3, 32, 32])
Batch labels shape: torch.Size([104])

Conclusion

In this article, we discussed using the Data Loader in PyTorch. We can process this data later to train neural networks. These classes are extremely useful when training our data on any existing model. This helps us save time and get good results since multiple developers contribute to the models, the open-source community, etc. It is also important to understand that different models may require different hyper parameters. So it depends upon the resources available and characteristics of the data about which parameters one should choose.

Updated on: 28-Jul-2023

125 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements