+ - 0:00:00
Notes for current slide
Notes for next slide

Module 7

Dataloading: datasets and dataloaders



Marc Lelarge

1/8       

Deep Learning pipeline

Dataset and Dataloader + Model + Loss and Optimizer = Training

1/8       

Overview of the course:

1- Course overview: machine learning pipeline

2- PyTorch tensors and automatic differentiation

3- Classification with deep learning

4- Convolutional neural networks

5- Embedding layers and dataloaders

2/8       

Dataloading

3/8       

Dataloading

Dataset class

torch.utils.data.Dataset is an abstract class representing a dataset. Your custom dataset should inherit Dataset and override the following methods:

  • __len__ so that len(dataset) returns the size of the dataset.
  • __getitem__ to support the indexing such that dataset[i] can be used to get ith sample
3/8       

Dataloading

Dataset class

torch.utils.data.Dataset is an abstract class representing a dataset. Your custom dataset should inherit Dataset and override the following methods:

  • __len__ so that len(dataset) returns the size of the dataset.
  • __getitem__ to support the indexing such that dataset[i] can be used to get ith sample

Iterating through the dataset with Dataloader

By using a simple for loop to iterate over the data, we are missing out on:

  • Batching the data
  • Shuffling the data
  • Load the data in parallel using multiprocessing workers.

torch.utils.data.DataLoader is an iterator which provides all these features.

3/8       

Examples (1)

In the first lesson, we created two datasets, one for the training and one for the validation:

from torchvision import transforms,datasets
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
imagenet_format = transforms.Compose([transforms.CenterCrop(224), transforms.ToTensor(), normalize])
dsets = {x: datasets.ImageFolder(os.path.join(data_dir, x), imagenet_format) for x in ['train', 'valid']}

Hence len(dsets['train']) returns 23000, i.e. the number of images in the training set, and more precisely, the number of files located at data_dir/train/.

Recall that data_dir/train/ (and similarly data_dir/valid/) is split in two folders: cats/ and dogs/, you can check that each of these folders have 11500 images with: ls | wc -l

You can recover the classes with dsets['train'].classes which returns ['cats', 'dogs'] These are features of the torchvision.datasets.ImageFolder Module.

4/8       

Examples (1)

In the first lesson, we created two datasets, one for the training and one for the validation:

from torchvision import transforms,datasets
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
imagenet_format = transforms.Compose([transforms.CenterCrop(224), transforms.ToTensor(), normalize])
dsets = {x: datasets.ImageFolder(os.path.join(data_dir, x), imagenet_format) for x in ['train', 'valid']}

Hence len(dsets['train']) returns 23000, i.e. the number of images in the training set, and more precisely, the number of files located at data_dir/train/.

Recall that data_dir/train/ (and similarly data_dir/valid/) is split in two folders: cats/ and dogs/, you can check that each of these folders have 11500 images with: ls | wc -l

You can recover the classes with dsets['train'].classes which returns ['cats', 'dogs'] These are features of the torchvision.datasets.ImageFolder Module.

More importantly, what returns dsets['train'][0]?

4/8       

Examples (1)

In the first lesson, we created two datasets, one for the training and one for the validation:

from torchvision import transforms,datasets
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
imagenet_format = transforms.Compose([transforms.CenterCrop(224), transforms.ToTensor(), normalize])
dsets = {x: datasets.ImageFolder(os.path.join(data_dir, x), imagenet_format) for x in ['train', 'valid']}

Hence len(dsets['train']) returns 23000, i.e. the number of images in the training set, and more precisely, the number of files located at data_dir/train/.

Recall that data_dir/train/ (and similarly data_dir/valid/) is split in two folders: cats/ and dogs/, you can check that each of these folders have 11500 images with: ls | wc -l

You can recover the classes with dsets['train'].classes which returns ['cats', 'dogs'] These are features of the torchvision.datasets.ImageFolder Module.

More importantly, what returns dsets['train'][0]?

Answer: a tuple containing a tensor and a label.

4/8       

Examples (1)

To obtain a dataloader for the training set:

train_loader = torch.utils.data.DataLoader(dsets['train'], batch_size=64, shuffle=True, num_workers=6)
5/8       

Examples (1)

To obtain a dataloader for the training set:

train_loader = torch.utils.data.DataLoader(dsets['train'], batch_size=64, shuffle=True, num_workers=6)

Then, you can use it as follows:

for input, label in train_loader:
output = model(input)
loss = loss_fn(output, label)
optimizer.zero_grad()
loss.backward()
optimizer.step()
....
5/8       

Examples (2)

In the first lesson, we first precomputed features and converted them as numpy arrays. You first need to load the features and the labels:

conv_feat_train,labels_train = preconvfeat(loader_train)

and then, you can create a list for your dataset as follows:

dtype=torch.float
datasetfeat_train = [[torch.from_numpy(f).type(dtype),torch.tensor(l).type(torch.long)] for (f,l) in zip(conv_feat_train,labels_train)]
datasetfeat_train = [(inputs.reshape(-1), classes) for [inputs,classes] in datasetfeat_train]

A list has a buil-in function len() and a __getitem__() method, hence this is a valid dataset for PyTorch.

6/8       

Examples (2)

In the first lesson, we first precomputed features and converted them as numpy arrays. You first need to load the features and the labels:

conv_feat_train,labels_train = preconvfeat(loader_train)

and then, you can create a list for your dataset as follows:

dtype=torch.float
datasetfeat_train = [[torch.from_numpy(f).type(dtype),torch.tensor(l).type(torch.long)] for (f,l) in zip(conv_feat_train,labels_train)]
datasetfeat_train = [(inputs.reshape(-1), classes) for [inputs,classes] in datasetfeat_train]

A list has a buil-in function len() and a __getitem__() method, hence this is a valid dataset for PyTorch.

To create a dataloader, it is as simple as:

loaderfeat_train = torch.utils.data.DataLoader(datasetfeat_train, batch_size=128, shuffle=True)
6/8       

Examples (3)

Today, we will make our own dataloader from scratch using a Python iterator:

def minibatch(batch_size, *tensors):
if len(tensors) == 1:
tensor = tensors[0]
for i in range(0, len(tensor), batch_size):
yield tensor[i:i + batch_size]
else:
for i in range(0, len(tensors[0]), batch_size):
yield tuple(x[i:i + batch_size] for x in tensors)
def shuffle(*arrays):
random_state = np.random.RandomState()
shuffle_indices = np.arange(len(arrays[0]))
random_state.shuffle(shuffle_indices)
if len(arrays) == 1:
return arrays[0][shuffle_indices]
else:
return tuple(x[shuffle_indices] for x in arrays)
7/8       

Examples (3)

Today, we will make our own dataloader from scratch using a Python iterator:

def minibatch(batch_size, *tensors):
if len(tensors) == 1:
tensor = tensors[0]
for i in range(0, len(tensor), batch_size):
yield tensor[i:i + batch_size]
else:
for i in range(0, len(tensors[0]), batch_size):
yield tuple(x[i:i + batch_size] for x in tensors)
def shuffle(*arrays):
random_state = np.random.RandomState()
shuffle_indices = np.arange(len(arrays[0]))
random_state.shuffle(shuffle_indices)
if len(arrays) == 1:
return arrays[0][shuffle_indices]
else:
return tuple(x[shuffle_indices] for x in arrays)

We will use it as follows (inside an epoch):

users, items, ratingss = shuffle(user_ids,item_ids,ratings)
user_ids_tensor = torch.from_numpy(users).to(device)
item_ids_tensor = torch.from_numpy(items).to(device)
ratings_tensor = torch.from_numpy(ratingss).to(device)
for (minibatch_num, (batch_user,batch_item,batch_rating)) in enumerate(minibatch(batch_size,
user_ids_tensor,item_ids_tensor,ratings_tensor)):
predictions = net(batch_user, batch_item)
...
7/8       

To know more about dataloading

have a look at the related PyTorch tutorial

8/8       

The end.

8/8       

Deep Learning pipeline

Dataset and Dataloader + Model + Loss and Optimizer = Training

1/8       

Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow