# Activation Functions Activation functions are used to introduce non-linearities in models. Without activation functions, any multi-layer network mathematically collapses into a single linear transformation, making depth meaningless. Without activation functions, consider what happens when you stack linear layers: Layer 1: y₁ = W₁x + b₁ Layer 2: y₂ = W₂y₁ + b₂ = W₂(W₁x + b₁) + b₂ = (W₂W₁)x + (W₂b₁ + b₂) Layer 3: y₃ = W₃y₂ + b₃ = W₃((W₂W₁)x + (W₂b₁ + b₂)) + b₃ = (W₃W₂W₁)x + ... No matter how many layers you add, you can always multiply out all the weight matrices into one equivalent matrix W_final = W₃W₂W₁... and combine all the biases into one equivalent bias b_final. The entire deep network becomes mathematically identical to: y = W_final × x + b_final So a 100-layer network without activation functions is literally just a single linear transformation in disguise! The depth becomes completely meaningless. That's why activation functions are absolutely essential - they break this mathematical collapse and allow each layer to contribute genuine complexity. ## Sigmoid $ h(x)=\sigma(x)=\frac{1}{1+e^{-x}} $ $ \frac{d}{dx} \sigma(x) = \sigma(x)(1 - \sigma(x)) $ ![[Sigmoid Gradient.png]] [[Derivative of Sigmoid]] ## Hyperbolic Tan $ h(x)=\tanh (x)=\frac{e^{x}-e^{-x}}{e^{x}+e^{-x}} $ $ \frac{d}{dx} \text{tanh}(x) = 1 - \text{tanh}^2(x) $ ![[tanh derivative.png]] [[Derivative of Tanh]] ## [[ReLU]] and variants ## [[Softmax]]