class: center, middle, title-slide count: false # Module 4: ## Optimization for deep learning
.bold[Marc Lelarge] --- # (1) Optimization and deep learning ## Gradient descent, stochastic gradient descent and mini-batch SGD # (2) Gradient descent optimization algorithms ## Momentum, Nesterov accelerated gradient, Adagrad, RMSProp, Adam, AMSGrad # (3) PyTorch optimizers --- # (1) Some warnings about optimization in deep learning? The objective function of an optimization algorithm is usually a loss function based on the training data set, hence the goal of optimization is to reduce the _training error_. -- count: false However, the goal of (deep) learning is to reduce the _generalization error_. -- count: false In order to reduce the generalization error, we need to pay attention to _overfitting_ in addition to using the optimization algorithm to reduce the training error. -- count: false In this course, we focus specifically on the _performance_ of the optimization algorithm in minimizing the objective function, rather than the model’s generalization error. In the next lessons, we will see techniques to avoid _overfitting_. -- count: false No theorem in this lecture (for theorems, see a convex optimization course). We focus on intuitions because deep learning optimization problems are not convex and we hope our _intuition_ built on convex problems will be useful! --- # (1) What do we optimize? Recall the simple linear regression where the goal is to minimize the .bold[cost function]: $$ J(\theta) = \frac{1}{2}\sum\_{i=1}^m(y(i)-\theta^T x(i))^2. $$ -- count: false For a deep neural network $F(.;\theta)$ with parameters $\theta$, we minimize the objective function: $$ J(\theta) = \frac{1}{2}\sum\_{i=1}^m(y(i)-F(x(i);\theta))^2, $$ in the paremter $\theta$ on the _training set_. This problem is not convex anymore but we can still compute the gradient of the cost function $\nabla J(\theta)$ with respect to the parameters (well PyTorch do it for us!). --- # (1) Gradient descent variants Idea: update the parameters in the opposite direction of the gradient. The _learning rate_ $\eta$ determines the size of the steps. -- count: false ## Batch gradient descent $$ \theta\_{t+1} = \theta\_t - \eta \nabla J(\theta\_t) $$ -- count: false ## Stochastic gradient descent $$ \theta\_{t+1} = \theta\_t - \eta \nabla J(\theta\_t;x(i),y(i)) $$ -- count: false ## Mini-batch gradient descent $$ \theta\_{t+1} = \theta\_t - \eta \nabla J(\theta\_t;x(i:i+b),y(i:i+b)), $$ where $b$ is the batch size. --- # (1) Challenges Mini-batch gradient descent is the algorithm of choice when training a neural network. The term SGD is usually employed also when mini-batches are used! -- count: false - Choosing a proper learning rate can be difficult. How to adapt the learning rate during training? - Why applying the same learning rate to all parameters updates? - How to escape saddle points where the gradient is close to zero in all dimension? .center[
] --- count:false # (1) Challenges Mini-batch gradient descent is the algorithm of choice when training a neural network. The term SGD is usually employed also when mini-batches are used! - Choosing a proper learning rate can be difficult. How to adapt the learning rate during training? - Why applying the same learning rate to all parameters updates? - How to escape saddle points where the gradient is close to zero in all dimension? In the rest of the lecture, we will introduce _modifications to SGD_. Also we will apply these modifications to mini-batch gradient descent, we omit the explicit reference to the batch $x(i:i+b),y(i:i+b)$. Hence we write for SGD: $$ \theta\_{t+1} = \theta\_t - \eta \nabla J(\theta\_t), $$ but we need to keep in mind that updates are performed for every mini-batch of $b$ training samples. --- # (2) Gradient descent optimization algorithms A nice [survey](http://ruder.io/optimizing-gradient-descent/) by Sebastian Ruder Please run the [jupyter notebook](https://colab.research.google.com/github/dataflowr/notebooks/blob/master/Module4/04_gradient_descent_optimization_algorithms_empty.ipynb) in parallel as you will need to code the optimization algorithms as soon as we see them. We will deal with the toy problem: minimize in $x_1,x_2$, the function: $$ f(x\_1, x\_2) = 0.1 x\_1^2+2 x\_2^2. $$ -- count: false You will need to implement all variations of SGD. Have a look at the `train_2d()` function and the Gradient descent code to understand what this function does. --- # (2) Momentum Accelerating SGD by dampening oscillations, i.e. by averaging the last values of the gradient. -- count: false $$\begin{aligned} v\_{t+1} &= \gamma v\_{t}+ \eta \nabla J(\theta\_t)\\\\ \theta\_{t+1} &= \theta\_t - v\_{t+1} \end{aligned}$$ -- count: false Why does it work? With $g\_t = \nabla J(\theta\_t)$, we have for any $k>0$: $$ v\_{t+1} = \gamma^k v\_{t-k} +\eta \underbrace{\sum\_{i=0}^k \gamma^i g\_{t-i}}_{\text{average of last gradients}} $$ Typical value for $\gamma= 0.9$. --- # (2) Nesterov accelerated gradient With momentum, we first compute the gradient and then make a step following our momentum and add the gradient. Nesterov proposed to first make the step following the momentum and then adjusting by computing the gradient locally: $$\begin{aligned} v\_{t+1} &= \gamma v\_{t}+ \eta \nabla J(\theta\_t-\gamma v\_{t})\\\\ \theta\_{t+1} &= \theta\_t - v\_{t+1} \end{aligned}$$ .center[
] Source for the image: [G. Hinton's lecture 6c](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf) .citation[Nesterov, A method for solving the convex programming problem with convergence rate O(1/k^2), Dokl. Akad. Nauk SSSR 1983] --- # (2) Adagrad We would like to adapt our updates to each individual parameter, i.e. have a different decreasing learning rate for each parameter. $$\begin{aligned} s\_{t+1,i} &= s\_{t,i} + \nabla J(\theta\_t)\_i^2\\\\ \theta\_{t+1,i} &= \theta\_{t,i} - \frac{\eta}{\sqrt{s\_{t+1,i}+\epsilon}}\nabla J(\theta\_t)\_i \end{aligned}$$ -- count: false No manual tuning of the learning rate. Typical default values: $\eta=0.01$ and $\epsilon = 1e-8$. .citation[Duchi et al., [Adaptive Subgradient Methods for Online Learning and Stochastic Optimization](http://jmlr.org/papers/v12/duchi11a.html), JMLR 2011] --- # (2) RMSProp Problem with Adagrad, learning rate goes to zero and never forgets about the past. -- count: false Idea proposed by G. Hinton in his coursera class: use exponential average. $$\begin{aligned} s\_{t+1,i} &= \gamma s\_{t,i} + (1-\gamma) \nabla J(\theta\_t)\_i^2\\\\ \theta\_{t+1,i} &= \theta\_{t,i} - \frac{\eta}{\sqrt{s\_{t+1,i}+\epsilon}}\nabla J(\theta\_t)\_i \end{aligned}$$ -- count: false With a slight abuse of notation, we re-write the update as follows: $$\begin{aligned} s\_{t+1} &= \gamma s\_{t} + (1-\gamma) \nabla J(\theta\_t)^2\\\\ \theta\_{t+1} &= \theta\_{t} - \frac{\eta}{\sqrt{s\_{t+1}+\epsilon}}\nabla J(\theta\_t) \end{aligned}$$ Typical values: $\gamma = 0.9$ and $\eta = 0.001$. .citation[Hinton [Coursera lecture 6](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf) ] --- # (2) Adam Mixing ideas from RMSProp and momentum, we get Adam = Adaptive moment Estimation. $$\begin{aligned} m\_{t+1} &= \beta\_1 m\_t + (1-\beta\_1) \nabla J(\theta\_t)\\\\ v\_{t+1} &= \beta\_2 v\_t + (1-\beta\_2) \nabla J(\theta\_t)^2\\\\ \hat{m}\_{t+1} &= \frac{m\_{t+1}}{1-\beta\_1^{t+1}}\\\\ \hat{v}\_{t+1} &= \frac{v\_{t+1}}{1-\beta\_2^{t+1}}\\\\ \theta\_{t+1} &= \theta\_{t} - \frac{\eta}{\sqrt{\hat{v}\_{t+1}}+\epsilon} \hat{m}\_{t+1} \end{aligned}$$ $\hat{m}\_t$ and $\hat{v}\_t$ are estimates for the first and second moments of the gradients. Because $m_0=v_0=0$, these estimates are biased towards $0$, the factors $(1-\beta^{t+1})^{-1}$ are here to counteract these biases. -- count: false Typical values: $\beta_1=0.9$, $\beta_2 =0.999$ and $\epsilon=1e-8$. .citation[Kingma et al. , [Adam: a Method for Stochastic Optimization](https://arxiv.org/abs/1412.6980), ICLR 2015] --- # (2) AMSGrad Sometimes, Adam forgets too fast, hence to fix it, we replace the moving average by a $\max$: $$\begin{aligned} m\_{t+1} &= \beta\_1 m\_t + (1-\beta\_1) \nabla J(\theta\_t)\\\\ v\_{t+1} &= \beta\_2 v\_t + (1-\beta\_2) \nabla J(\theta\_t)^2\\\\ \hat{v}\_{t+1} &= \max\left(\hat{v}\_{t}, v\_{t+1}\right)\\\\ \theta\_{t+1} &= \theta\_{t} - \frac{\eta}{\sqrt{\hat{v}\_{t+1}}+\epsilon} m\_{t+1} \end{aligned}$$ .citation[Reddi et al.,[On the Convergence of Adam and Beyond](https://openreview.net/forum?id=ryQu7f-RZ) ICLR 2018] --- # (3) [PyTorch optimizers](https://pytorch.org/docs/stable/optim.html) All have similar constructor `torch.optim.*(params, lr=..., momentum=...)`. Default values are different for all optimizer, check the doc. `params` should be an iterable (like a list) containing the parameters to optimize over. It can be obtained from any module with `module.parameters()`. The `step` method updates the internal state of the optimizer according to the `grad` attributes of the `params`, and updates the latter according to the internal state. ```py criterion = nn.NLLLoss() optimizer_vgg = torch.optim.SGD(model_vgg.classifier[6].parameters(),lr = 0.001) def train_model(model,dataloader,size,epochs=1,optimizer=None): model.train() for epoch in range(epochs): for inputs,classes in dataloader: inputs = inputs.to(device) classes = classes.to(device) outputs = model(inputs) loss = criterion(outputs,classes) optimizer.zero_grad() loss.backward() optimizer.step() ``` --- class: end-slide, center count: false The end.