Automatic Differentiation

# Automatic Differentiation Automatic/Algorithmic Differentiation (AD) is the application of the chain rule to functions by computers in order to automatically compute derivatives. From the definition of the derivatives, we have: $ f^{\prime}(c)=\lim _{h \rightarrow 0} \frac{f(c+h)-f(c)}{h} $ Numerical differentiation is done with the finite difference rule, given as: $ f^{\prime}(x) \approx \frac{f(x+h)-f(x)}{h} $ The main problem with this formulae is that it suffers from numerical instability for small values of $h$. For example, float $(\sqrt{(2)}) *$ float $(\sqrt{(2)})$ may equal $\approx 2.000000446$. As opposed to numerical differentiation, AD is a procedure for establishing **exact and accurate derivatives** without any truncation errors. AD breaks a computer program into a series of fundamental mathematical operations represented as a directed acyclic graph (DAGs). Then gradient or Hessian of the computer program is computed by successive application of the chain rule to it's elementary constituents. There are two variants of AD based on the order in which the chain rule is being utilized: - Forward mode - propagates derivatives from the dependent towards the independent variables - starts at the bottom of a DAG and moves towards the top - Reverse mode - propagates derivatives from the independent towards dependent variables. - starts from the top of the DAG and moves towards the bottom Both forward and reverse mode yield same results. However, the key differences are: 1. In reverse mode AD there are fewer operations (time) and less space for intermediates (memory). 2. The cost for forward mode grows with N inputs. 3. The cost for reverse mode grows with M outputs. The following DAG represents $g(x)=2 x^2-x+1$. The constants are shown in gray and crossed out as derivatives should not be propagated to constant operands. ![[DAG-auto-diff.png]] In Pytorch, one can implement custom Autograd functions as such: ```python from torch.autograd import Function class Sigmoid(Function): @staticmethod def forward(ctx, x): output = 1 / (1 + torch.exp(-x)) ctx.save_for_backward(output) return output @staticmethod def backward(ctx, grad_output): output, = ctx.saved_tensors grad_x = output * (1 - output) * grad_output return grad_x ``` ```python from torch.autograd import Function class ReLU(torch.autograd.Function): @staticmethod def forward(ctx, input): ctx.save_for_backward(input) return input.clamp(min=0) @staticmethod def backward(ctx, grad_output): input, = ctx.saved_tensors grad_input = grad_output.clone() grad_input[input < 0] = 0 return grad_input ``` --- ## References