# Automatic Differentiation
Automatic/Algorithmic Differentiation (AD) is the application of the chain rule to functions by computers in order to automatically compute derivatives.
From the definition of the derivatives, we have:
$
f^{\prime}(c)=\lim _{h \rightarrow 0} \frac{f(c+h)-f(c)}{h}
$
Numerical differentiation is done with the finite difference rule, given as:
$
f^{\prime}(x) \approx \frac{f(x+h)-f(x)}{h}
$
The main problem with this formulae is that it suffers from numerical instability for small values of $h$. For example, float $(\sqrt{(2)}) *$ float $(\sqrt{(2)})$ may equal $\approx 2.000000446$.
As opposed to numerical differentiation, AD is a procedure for establishing **exact and accurate derivatives** without any truncation errors. AD breaks a computer program into a series of fundamental mathematical operations represented as a directed acyclic graph (DAGs). Then gradient or Hessian of the computer program is computed by successive application of the chain rule to it's elementary constituents.
There are two variants of AD based on the order in which the chain rule is being utilized:
- Forward mode
- propagates derivatives from the dependent towards the independent variables
- starts at the bottom of a DAG and moves towards the top
- Reverse mode
- propagates derivatives from the independent towards dependent variables.
- starts from the top of the DAG and moves towards the bottom
Both forward and reverse mode yield same results. However, the key differences are:
1. In reverse mode AD there are fewer operations (time) and less space for intermediates (memory).
2. The cost for forward mode grows with N inputs.
3. The cost for reverse mode grows with M outputs.
The following DAG represents $g(x)=2 x^2-x+1$. The constants are shown in gray and crossed out as derivatives should not be propagated to constant operands.
![[DAG-auto-diff.png]]
In Pytorch, one can implement custom Autograd functions as such:
```python
from torch.autograd import Function
class Sigmoid(Function):
@staticmethod
def forward(ctx, x):
output = 1 / (1 + torch.exp(-x))
ctx.save_for_backward(output)
return output
@staticmethod
def backward(ctx, grad_output):
output, = ctx.saved_tensors
grad_x = output * (1 - output) * grad_output
return grad_x
```
```python
from torch.autograd import Function
class ReLU(torch.autograd.Function):
@staticmethod
def forward(ctx, input):
ctx.save_for_backward(input)
return input.clamp(min=0)
@staticmethod
def backward(ctx, grad_output):
input, = ctx.saved_tensors
grad_input = grad_output.clone()
grad_input[input < 0] = 0
return grad_input
```
---
## References