Haom.in!

Hi!

PyTorch: Automatic Differentiation

527 words
3 min read
Automatic-Differentiation

Grad

Start with a simple function:

x = torch.arrange(4.0)
x # tensor([0., 1., 2., 3.])

In oder to avoid allocating new memory everytime when take a derivative, we need to declare requires_grad=True, or set it manually.

x.requires_grad_(True)
# same as
x = torch.arrange(4.0, requires_grad=True)
x.grad # None by default

Now, when use x in forward pass, PyTorch will build a computation graph, for example:

x = torch.tensor(2.0, requires_grad=True)
y = x ** 2
z = y + 3

# build up a graph
$ x -> (square) -> y -> (+3) -> z

Since we enable grad, z will define a backward function,

tensor(7., grad_fn=<MulBackward0>)

and when we run z.backward(), PyTorch will:

  1. Starts at z
  2. Applies chain rule backward
  3. Stores gradients in .grad
z.backward()
x.grad # tensor(4.) => 2x => 4

Backward for non-scalar variables

However, we can only call backward() on a scalar, for example, we cannot call y.backward() where y = x * x.

When y is a vector, the most natural representation of the derivative of y to the vector x is the Jacobian matrix, which contains the partial derivatives of each component of y with respect to each component of x.

For example, suppose we have:

f(u,v)=(x(u,v)y(u,v))f(u, v) = \begin{pmatrix} x(u, v) \\ y(u, v) \end{pmatrix}

Then the Jacobian matrix is:

J=(xuxvyuyv)J = \begin{pmatrix} \frac{\partial x}{\partial u} & \frac{\partial x}{\partial v} \\ \frac{\partial y}{\partial u} & \frac{\partial y}{\partial v} \end{pmatrix}

The result of differentiation could be an even higher-order tensor for higher-order x and y. PyTorch will not be able to choose one combination of outputs, for example, sum, mean, …

returning the full Jacobian is almost always infeasible, unnecessary, and incompatible with how training actually works.

However, we can provide some vector vv for backward, so it will compute vTxyv^T \partial_x y. This will also allow pyTorch to calculate directly, without forming the entire Jacobian. for example,

x.grad.zero_() # clean up grad buffer
y = x * x
y.backward(gradient=torch.ones(len(y)))
x.grad # tensor([0., 2., 4., 6.,])

# this is same as: y.sum().backward()

Detaching

Sometime, we may not want gradients to flow through all paths, may be:

  • One part is a fixed target
  • One part comes from an old network
  • One value is used as a constant weight

in these cases, we can detach a variable, and treated as a constant.

for example,

x = torch.tensor(2.0, requires_grad=True)
w = torch.tensor(3.0, requires_grad=True)

y = x * x        # y = x²
z = w * w        # z = w²
u = y * z        # u = x² w²

u.backward()

When we call x.grad, it is actually, ux\frac{\partial u}{\partial x}, and give us 2w2x=362w^2x = 36, similar for w.

we can try to treat y as a constant, by calling y.detach().

u = y.detach() * z

x.grad # 0
w.grad # 24 = 2w * y

So we can reuse a value numerically, like y, but stop gradient flow through its computation path.

Reference

This section is a study note of the book Dive into Deep Learning.