Marc Lelarge
1- Course overview: machine learning pipeline
2- PyTorch tensors and automatic differentiation
3- Classification with deep learning
4- Convolutional neural networks
5- Embedding layers and dataloaders
torch.utils.data
torch.utils.data.Dataset
is an abstract class representing a dataset. Your custom dataset should inherit Dataset
and override the following methods:
__len__
so that len(dataset)
returns the size of the dataset.__getitem__
to support the indexing such that dataset[i]
can be used to get ith sampletorch.utils.data.Dataset
is an abstract class representing a dataset. Your custom dataset should inherit Dataset
and override the following methods:
__len__
so that len(dataset)
returns the size of the dataset.__getitem__
to support the indexing such that dataset[i]
can be used to get ith sampleDataloader
By using a simple for
loop to iterate over the data, we are missing out on:
torch.utils.data.DataLoader
is an iterator which provides all these features.
In the first lesson, we created two datasets, one for the training and one for the validation:
from torchvision import transforms,datasetsnormalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])imagenet_format = transforms.Compose([transforms.CenterCrop(224), transforms.ToTensor(), normalize])dsets = {x: datasets.ImageFolder(os.path.join(data_dir, x), imagenet_format) for x in ['train', 'valid']}
Hence len(dsets['train'])
returns 23000
, i.e. the number of images in the training set, and more precisely, the number of files located at data_dir/train/
.
Recall that data_dir/train/
(and similarly data_dir/valid/
) is split in two folders: cats/
and dogs/
, you can check that each of these folders have 11500
images with: ls | wc -l
You can recover the classes with dsets['train'].classes
which returns ['cats', 'dogs']
These are features of the torchvision.datasets.ImageFolder
Module.
In the first lesson, we created two datasets, one for the training and one for the validation:
from torchvision import transforms,datasetsnormalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])imagenet_format = transforms.Compose([transforms.CenterCrop(224), transforms.ToTensor(), normalize])dsets = {x: datasets.ImageFolder(os.path.join(data_dir, x), imagenet_format) for x in ['train', 'valid']}
Hence len(dsets['train'])
returns 23000
, i.e. the number of images in the training set, and more precisely, the number of files located at data_dir/train/
.
Recall that data_dir/train/
(and similarly data_dir/valid/
) is split in two folders: cats/
and dogs/
, you can check that each of these folders have 11500
images with: ls | wc -l
You can recover the classes with dsets['train'].classes
which returns ['cats', 'dogs']
These are features of the torchvision.datasets.ImageFolder
Module.
More importantly, what returns dsets['train'][0]
?
In the first lesson, we created two datasets, one for the training and one for the validation:
from torchvision import transforms,datasetsnormalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])imagenet_format = transforms.Compose([transforms.CenterCrop(224), transforms.ToTensor(), normalize])dsets = {x: datasets.ImageFolder(os.path.join(data_dir, x), imagenet_format) for x in ['train', 'valid']}
Hence len(dsets['train'])
returns 23000
, i.e. the number of images in the training set, and more precisely, the number of files located at data_dir/train/
.
Recall that data_dir/train/
(and similarly data_dir/valid/
) is split in two folders: cats/
and dogs/
, you can check that each of these folders have 11500
images with: ls | wc -l
You can recover the classes with dsets['train'].classes
which returns ['cats', 'dogs']
These are features of the torchvision.datasets.ImageFolder
Module.
More importantly, what returns dsets['train'][0]
?
Answer: a tuple containing a tensor and a label.
To obtain a dataloader for the training set:
train_loader = torch.utils.data.DataLoader(dsets['train'], batch_size=64, shuffle=True, num_workers=6)
To obtain a dataloader for the training set:
train_loader = torch.utils.data.DataLoader(dsets['train'], batch_size=64, shuffle=True, num_workers=6)
Then, you can use it as follows:
for input, label in train_loader: output = model(input) loss = loss_fn(output, label) optimizer.zero_grad() loss.backward() optimizer.step() ....
In the first lesson, we first precomputed features and converted them as numpy
arrays. You first need to load the features and the labels:
conv_feat_train,labels_train = preconvfeat(loader_train)
and then, you can create a list for your dataset as follows:
dtype=torch.floatdatasetfeat_train = [[torch.from_numpy(f).type(dtype),torch.tensor(l).type(torch.long)] for (f,l) in zip(conv_feat_train,labels_train)]datasetfeat_train = [(inputs.reshape(-1), classes) for [inputs,classes] in datasetfeat_train]
A list has a buil-in function len()
and a __getitem__()
method, hence this is a valid dataset for PyTorch.
In the first lesson, we first precomputed features and converted them as numpy
arrays. You first need to load the features and the labels:
conv_feat_train,labels_train = preconvfeat(loader_train)
and then, you can create a list for your dataset as follows:
dtype=torch.floatdatasetfeat_train = [[torch.from_numpy(f).type(dtype),torch.tensor(l).type(torch.long)] for (f,l) in zip(conv_feat_train,labels_train)]datasetfeat_train = [(inputs.reshape(-1), classes) for [inputs,classes] in datasetfeat_train]
A list has a buil-in function len()
and a __getitem__()
method, hence this is a valid dataset for PyTorch.
To create a dataloader, it is as simple as:
loaderfeat_train = torch.utils.data.DataLoader(datasetfeat_train, batch_size=128, shuffle=True)
Today, we will make our own dataloader from scratch using a Python iterator:
def minibatch(batch_size, *tensors): if len(tensors) == 1: tensor = tensors[0] for i in range(0, len(tensor), batch_size): yield tensor[i:i + batch_size] else: for i in range(0, len(tensors[0]), batch_size): yield tuple(x[i:i + batch_size] for x in tensors)
def shuffle(*arrays): random_state = np.random.RandomState() shuffle_indices = np.arange(len(arrays[0])) random_state.shuffle(shuffle_indices) if len(arrays) == 1: return arrays[0][shuffle_indices] else: return tuple(x[shuffle_indices] for x in arrays)
Today, we will make our own dataloader from scratch using a Python iterator:
def minibatch(batch_size, *tensors): if len(tensors) == 1: tensor = tensors[0] for i in range(0, len(tensor), batch_size): yield tensor[i:i + batch_size] else: for i in range(0, len(tensors[0]), batch_size): yield tuple(x[i:i + batch_size] for x in tensors)
def shuffle(*arrays): random_state = np.random.RandomState() shuffle_indices = np.arange(len(arrays[0])) random_state.shuffle(shuffle_indices) if len(arrays) == 1: return arrays[0][shuffle_indices] else: return tuple(x[shuffle_indices] for x in arrays)
We will use it as follows (inside an epoch):
users, items, ratingss = shuffle(user_ids,item_ids,ratings)user_ids_tensor = torch.from_numpy(users).to(device)item_ids_tensor = torch.from_numpy(items).to(device)ratings_tensor = torch.from_numpy(ratingss).to(device)for (minibatch_num, (batch_user,batch_item,batch_rating)) in enumerate(minibatch(batch_size, user_ids_tensor,item_ids_tensor,ratings_tensor)): predictions = net(batch_user, batch_item) ...
The end.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |