Loading Data in Pytorch


Every machine learning project depends on data, and PyTorch, the well-known open-source machine learning toolkit created by Facebook, is no exception. This manual seeks to streamline the data loading procedure into PyTorch and get you up and running as soon as possible.

The DataLoader, Dataset, and Transform classes of PyTorch will be the main topics of this article. To help you understand these core PyTorch ideas and streamline your machine learning applications, we'll go over some real-world examples.

PyTorch Data Loading: A Brief Overview

For importing and preparing data, PyTorch offers a powerful and adaptable toolbox. The three key elements are −

  • Dataset  This abstract class, which embodies a dataset, enables the loading of data in any format. Just the two methods __getitem__() and __len__() need to be overridden.

  • DataLoader  This encapsulates a Dataset and offers quick access to the underlying data. It builds batches automatically, shuffles the data, and loads the data in parallel using multi-threading.

  • Transforms  These are typical image modifications. Transforms can be used to chain them together.Compose. This enables you to create a pipeline of preprocessing operations that may be used on the loaded data.

Loading Data into PyTorch: An Example

Consider an image collection where each image is represented as a 3D NumPy array and the labels are kept separate from the images. Here is a quick method for adding this data to PyTorch.

from torch.utils.data import Dataset, DataLoader
import numpy as np

class ImageDataset(Dataset):
   def __init__(self, images, labels):
      self.images = images
      self.labels = labels

   def __getitem__(self, index):
      return self.images[index], self.labels[index]

   def __len__(self):
      return len(self.labels)

# Let's assume we have image data in NumPy arrays
images = np.random.rand(10000, 3, 32, 32)
labels = np.random.randint(0, 10, 10000)

dataset = ImageDataset(images, labels)
dataloader = DataLoader(dataset, batch_size=4, shuffle=True, num_workers=4)

We have developed a unique Dataset class in the aforementioned code. While the __len__ function delivers the total number of photos, the __getitem__ method returns the image and the label at the provided index. The DataLoader that will manage batching and data shuffle will then be wrapped around this Dataset.

Using Transforms with PyTorch

You can preprocess your data in a flexible way with transforms. For instance, we frequently need to normalise the data, transform it to a tensor, or use data augmentation techniques in image-based tasks. These tasks are simple with PyTorch's transformations module.

from torchvision import transforms

# Define a transform to normalize the data
transform = transforms.Compose([
   transforms.ToTensor(),
   transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

# Apply the transform to all images in the dataset
class ImageDataset(Dataset):
   def __init__(self, images, labels, transform=None):
      self.images = images
      self.labels = labels
      self.transform = transform

   def __getitem__(self, index):
      image = self.images[index]
      if self.transform:
         image = self.transform(image)
      return image, self.labels[index]

   def __len__(self):
      return len(self.labels)

dataset = ImageDataset(images, labels, transform=transform)
dataloader = DataLoader(dataset, batch_size=4, shuffle=True, num_workers=4)

In this illustration, the transform converts the image data to a PyTorch tensor after normalising it. When we instantiate our ImageDataset, we pass it this transform, and it is then applied to every image in the '__getitem__' method.

Loading Data From CSV Files

Data from CSV files must frequently be loaded for operations like regression analysis and classification. Let's use pandas to load a CSV file, process the information, and build a PyTorch DataLoader.

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from torch.utils.data import TensorDataset

# Load the data from a CSV file
df = pd.read_csv('data.csv')

# Convert categorical data to numerical data
le = LabelEncoder()
df['category'] = le.fit_transform(df['category'])

# Split the data into inputs and targets
inputs = df.drop('category', axis=1).values
targets = df['category'].values

# Convert to PyTorch Dataset
dataset = TensorDataset(torch.from_numpy(inputs), torch.from_numpy(targets))

# Wrap in a DataLoader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

In this example, pandas is used to load the data from a CSV file. The LabelEncoder function in Scikit-Learn is then used to transform categorical data into numerical data. The inputs and targets are divided, they are transformed into PyTorch tensors, and a TensorDataset is produced. In order to handle batching and shuffling, we create a DataLoader last.

Conclusion

An essential skill for creating effective machine learning models in PyTorch is data loading. The work is made simpler and more efficient using PyTorch's DataLoader, Dataset, and Transform classes. These classes can be modified to meet your needs whether you are working with tabular data or picture data.

Updated on: 18-Jul-2023

134 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements