ReLU - Notes on AI

# ReLU Rectified Linear Unit is one of the most successful [[Activation Functions]]. It is defined as: $ h(x)=\operatorname{ReLU}(x)=\max (0, x) $ $ \frac{d}{dx}ReLU(x)= \begin{cases}0 & \text { if } x<0 \\ 1 & \text { if } x>0\\ \text{undefined} & \text { if } x=0 \end{cases} $ ![[ReLU and Swish.png]] Sigmoid and tanh, although biologically inspired, are less popular in deep neural network architectures because of their saturating nature towards large positive and negative inputs. In deeper models, this means the gradients around these inputs become very small, and have cascading effects of causing numerical instabilities as the depth increases. ReLU solves this problem by being a linear function in positive range, which means gradients do not saturate. Additionally, its gradient is 1 and so when the they are multiplied with other tensors it neither explodes or diminishes quickly, helping with the classic [[Vanishing and Exploding Gradients]] problem. There are several minor modifications to ReLU that are extensively used in large models these days, for example, Swish, GeLU etc. Additional interesting things about ReLU: **Dead Neurons Problem:** - ReLU neurons "die" when they always receive negative inputs, causing them to always output 0 - Dead neurons have zero gradients, so they can never recover or learn anything - If your data isn't centered and is heavily skewed positive or negative, entire layers of ReLU neurons might die **Activation Diversity:** 1. With centered data, you get a healthy mix of active/inactive neurons from the start. 2. Active neurons learn from positive examples, while inactive neurons are "ready to learn" when they encounter the right patterns. 3. This creates a richer gradient landscape and better feature learning. **Symmetry Breaking:** - Without centering, if all neurons in a layer receive similar inputs, they'll learn identical features (symmetry) - Centering around zero creates natural asymmetry: some neurons get positive inputs (activate), others get negative inputs (stay silent) - This initial diversity in activation patterns forces different neurons to specialize in different aspects of the data. This is why techniques like [[Weight Initialization|Xavier/He initialization]] specifically account for maintaining this balance!