# Activation Functions
Activation functions are used to introduce non-linearities in models. Without activation functions, any multi-layer network mathematically collapses into a single linear transformation, making depth meaningless.
Without activation functions, consider what happens when you stack linear layers:
Layer 1: y₁ = W₁x + b₁
Layer 2: y₂ = W₂y₁ + b₂ = W₂(W₁x + b₁) + b₂ = (W₂W₁)x + (W₂b₁ + b₂)
Layer 3: y₃ = W₃y₂ + b₃ = W₃((W₂W₁)x + (W₂b₁ + b₂)) + b₃ = (W₃W₂W₁)x + ...
No matter how many layers you add, you can always multiply out all the weight matrices into one equivalent matrix W_final = W₃W₂W₁... and combine all the biases into one equivalent bias b_final.
The entire deep network becomes mathematically identical to: y = W_final × x + b_final
So a 100-layer network without activation functions is literally just a single linear transformation in disguise! The depth becomes completely meaningless. That's why activation functions are absolutely essential - they break this mathematical collapse and allow each layer to contribute genuine complexity.
## Sigmoid
$
h(x)=\sigma(x)=\frac{1}{1+e^{-x}}
$
$
\frac{d}{dx} \sigma(x) = \sigma(x)(1 - \sigma(x))
$
![[Sigmoid Gradient.png]]
[[Derivative of Sigmoid]]
## Hyperbolic Tan
$
h(x)=\tanh (x)=\frac{e^{x}-e^{-x}}{e^{x}+e^{-x}}
$
$
\frac{d}{dx} \text{tanh}(x) = 1 - \text{tanh}^2(x)
$
![[tanh derivative.png]]
[[Derivative of Tanh]]
## [[ReLU]] and variants
## [[Softmax]]