PyTorch: Automatic Differentiation

Automatic-Differentiation

Grad

Start with a simple function:

x = torch.arrange(4.0)
x # tensor([0., 1., 2., 3.])

In oder to avoid allocating new memory everytime when take a derivative, we need to declare requires_grad=True, or set it manually.

x.requires_grad_(True)
# same as
x = torch.arrange(4.0, requires_grad=True)
x.grad # None by default

Now, when use x in forward pass, PyTorch will build a computation graph, for example:

x = torch.tensor(2.0, requires_grad=True)
y = x ** 2
z = y + 3

# build up a graph
$ x -> (square) -> y -> (+3) -> z

Since we enable grad, z will define a backward function,

tensor(7., grad_fn=<MulBackward0>)

and when we run z.backward(), PyTorch will:

Starts at z
Applies chain rule backward
Stores gradients in .grad

z.backward()
x.grad # tensor(4.) => 2x => 4

Backward for non-scalar variables

However, we can only call backward() on a scalar, for example, we cannot call y.backward() where y = x * x.

When y is a vector, the most natural representation of the derivative of y to the vector x is the Jacobian matrix, which contains the partial derivatives of each component of y with respect to each component of x.

For example, suppose we have:

$f(u, v) = \begin{pmatrix} x(u, v) \\ y(u, v) \end{pmatrix}$

Then the Jacobian matrix is:

$J = \begin{pmatrix} \frac{\partial x}{\partial u} & \frac{\partial x}{\partial v} \\ \frac{\partial y}{\partial u} & \frac{\partial y}{\partial v} \end{pmatrix}$

The result of differentiation could be an even higher-order tensor for higher-order x and y. PyTorch will not be able to choose one combination of outputs, for example, sum, mean, …

returning the full Jacobian is almost always infeasible, unnecessary, and incompatible with how training actually works.

However, we can provide some vector $v$ for backward, so it will compute $v^T \partial_x y$ . This will also allow pyTorch to calculate directly, without forming the entire Jacobian. for example,

x.grad.zero_() # clean up grad buffer
y = x * x
y.backward(gradient=torch.ones(len(y)))
x.grad # tensor([0., 2., 4., 6.,])

# this is same as: y.sum().backward()

Detaching

Sometime, we may not want gradients to flow through all paths, may be:

One part is a fixed target
One part comes from an old network
One value is used as a constant weight

in these cases, we can detach a variable, and treated as a constant.

for example,

x = torch.tensor(2.0, requires_grad=True)
w = torch.tensor(3.0, requires_grad=True)

y = x * x        # y = x²
z = w * w        # z = w²
u = y * z        # u = x² w²

u.backward()

When we call x.grad, it is actually, $\frac{\partial u}{\partial x}$ , and give us $2w^2x = 36$ , similar for w.

we can try to treat y as a constant, by calling y.detach().

u = y.detach() * z

x.grad # 0
w.grad # 24 = 2w * y

So we can reuse a value numerically, like y, but stop gradient flow through its computation path.

Reference

This section is a study note of the book Dive into Deep Learning.

Haom.in!

Hi!

PyTorch: Automatic Differentiation

Grad

Backward for non-scalar variables

Detaching

Reference