Hence the Jacobian Jf(x)∈Rm×n is a linear map from Rn to Rm such that for x,v∈Rn and h∈R:
f(x+hv)=f(x)+hJf(x)v+o(h).
The term Jf(x)v∈Rm is a Jacobian Vector Product (JVP), corresponding to the interpretation where the Jacobian is the linear map: Jf(x):Rn→Rm, where Jf(x)(v)=Jf(x)v.
In machine learning, we are computing gradient of the loss function with respect to the parameters. In particular, if the parameters are high-dimensional, the loss is a real number. Hence, consider a real-valued function f:Rn→g1Rm→g2Rd→hR, so that f(x)=h(g2(g1(x)))∈R. We have
To do this computation, if we start from the right so that we start with a matrix times a vector to obtain a vector (of size m) and we need to make another matrix times a vector, resulting in O(nm+md) operations. If we start from the left with the matrix-matrix multiplication, we get O(nmd+nd) operations. Hence we see that as soon as m≈d, starting for the right is much more efficient. Note however that doing the computation from the right to the left requires keeping in memory the values of g1(x)∈Rm, and x∈Rn.
Backpropagation is an efficient algorithm computing the gradient "from the right to the left", i.e. backward. In particular, we will need to compute quantities of the form: Jf(x)Tu∈Rn with u∈Rm which can be rewritten uTJf(x) which is a Vector Jacobian Product (VJP), correponding to the interpretation where the Jacobian is the linear map: Jf(x):Rn→Rm, composed with the linear map u:Rm→R so that uTJf(x)=u∘Jf(x).
example: let f(x,W)=xW∈Rb where W∈Ra×b and x∈Ra. We clearly have
Jf(x)=WT.
Note that here, we are slightly abusing notations and considering the partial function x↦f(x,W). To see this, we can write fj=∑ixiWij so that
∂xi∂f=(Wi1…Wib)T
Then recall from definitions that
Jf(x)=(∂x1∂f,…,∂xn∂f)=WT.
Now we clearly have
Jf(W)=x since, f(x,W+ΔW)=f(x,W)+xΔW.
Note that multiplying x on the left is actually convenient when using broadcasting, i.e. we can take a batch of input vectors of shape bs×a without modifying the math above.
In PyTorch, torch.autograd provides classes and functions implementing automatic differentiation of arbitrary scalar-valued functions. To create a custom autograd.Function, subclass this class and implement the forward() and backward() static methods. Here is an example:
classExp(Function):
@staticmethoddefforward(ctx, i):
result = i.exp()
ctx.save_for_backward(result)
return result
@staticmethoddefbackward(ctx, grad_output):
result, = ctx.saved_tensors
return grad_output * result
# Use it by calling the apply method:
output = Exp.apply(input)
Here we will implement in numpy a different approach mimicking the functional approach of JAX see The Autodiff Cookbook.
Each function will take 2 arguments: one being the input x and the other being the parameters w. For each function, we build 2 vjp functions taking as argument a gradient u, and corresponding to Jf(x) and Jf(w) so that these functions return Jf(x)Tu and Jf(w)Tu respectively. To summarize, for x∈Rn, w∈Rd, and, f(x,w)∈Rm,
vjpx(u)vjpw(u)=Jf(x)Tu, with Jf(x)∈Rm×n,u∈Rm=Jf(w)Tu, with Jf(w)∈Rm×d,u∈Rm
Then backpropagation is simply done by first computing the gradient of the loss and then composing the vjp functions in the right order.