Grad
Start with a simple function:
x = torch.arrange(4.0)
x # tensor([0., 1., 2., 3.])
In oder to avoid allocating new memory everytime when take a derivative, we need to declare requires_grad=True,
or set it manually.
x.requires_grad_(True)
# same as
x = torch.arrange(4.0, requires_grad=True)
x.grad # None by default
Now, when use x in forward pass, PyTorch will build a computation graph, for example:
x = torch.tensor(2.0, requires_grad=True)
y = x ** 2
z = y + 3
# build up a graph
$ x -> (square) -> y -> (+3) -> z
Since we enable grad, z will define a backward function,
tensor(7., grad_fn=<MulBackward0>)
and when we run z.backward(), PyTorch will:
- Starts at
z - Applies chain rule backward
- Stores gradients in
.grad
z.backward()
x.grad # tensor(4.) => 2x => 4
Backward for non-scalar variables
However, we can only call backward() on a scalar, for example, we cannot call y.backward() where y = x * x.
When y is a vector, the most natural representation of the derivative of y to the vector x is the Jacobian matrix,
which contains the partial derivatives of each component of y with respect to each component of x.
For example, suppose we have:
Then the Jacobian matrix is:
The result of differentiation could be an even higher-order tensor for higher-order x and y.
PyTorch will not be able to choose one combination of outputs, for example, sum, mean, …
returning the full Jacobian is almost always infeasible, unnecessary, and incompatible with how training actually works.
However, we can provide some vector for backward, so it will compute .
This will also allow pyTorch to calculate directly, without forming the entire Jacobian.
for example,
x.grad.zero_() # clean up grad buffer
y = x * x
y.backward(gradient=torch.ones(len(y)))
x.grad # tensor([0., 2., 4., 6.,])
# this is same as: y.sum().backward()
Detaching
Sometime, we may not want gradients to flow through all paths, may be:
- One part is a fixed target
- One part comes from an old network
- One value is used as a constant weight
in these cases, we can detach a variable, and treated as a constant.
for example,
x = torch.tensor(2.0, requires_grad=True)
w = torch.tensor(3.0, requires_grad=True)
y = x * x # y = x²
z = w * w # z = w²
u = y * z # u = x² w²
u.backward()
When we call x.grad, it is actually, , and give us ,
similar for w.
we can try to treat y as a constant, by calling y.detach().
u = y.detach() * z
x.grad # 0
w.grad # 24 = 2w * y
So we can reuse a value numerically, like y, but stop gradient flow through its computation path.
Reference
This section is a study note of the book Dive into Deep Learning.