Convolutional Neural Networks (CNN)

# Convolutional Neural Networks (CNN) ## What makes images spatial? What makes images different? - Spatial dimensions (2D projection of 3D world) - Extrememly high dimensional image (1920x1080x3 = 6M pixels!) - Input dimensions are extremely correlated (we can't percieve even small shifts, even though values change drastically) - Cropping/shifting/occluding dimensions are still an image - Possibly with same semantics - Basic natural image statistics are the same Convolutional neural networks - Adds inductive bias to neural networks to deal with spatial signals - Use convolutional filters to encode spatial structure - Use local connectivity, parameter sharing, translation equivariance, to account for the huge input dimensionalities - Use spatial pooling to remain robust to local variations Why spatial? - Images are 2D (or 3D if you count the channels) - What does 2D input really mean? Neighboring variables are locally correlated Learnable filters - Image processing has many handcrafted filters. But are they optimal for recognition? - Can we learn optimal filters from our data instead? Will they resemble the handcrafted filters? Hypothesis (Actually work out!) - Image statistics are not location dependent - Natural images are stationary - The same filters should work on every corner of the image similarly - Perhaps move and reuse the same (red, yellow, green) filter across the whole image? ## Convolution [[Convolution]] are operators that "blend" two functions. In cases of images, we define convolution as (filter slider over the image) $ I(x, y) * h=\sum_{i=-a}^{a} \sum_{j=-b}^{b} \overbrace{I(x-i, y-j) \cdot h(i, j)}^{\text {Inner product }} $ ![[conv-cross-auto.jpg]] The dimensions after convolutions: ![[conv-output-dims.jpg]] ### Local Connectivity Local connectivity implies: - Share weights spatially (after translation) - Surface-wise local - Depth-wise (across channels) global In MLPs (fully connected), no local connectivity. Everything connected to everything. No notion of "space", surface or depth Shuffling the pixels will cause no difference in the output. But local connectivity doesn't necessarily mean convolutional filters. ![[local-but-nonsharing.jpg]] Using local connectivity but without sharing parameters (like above) can be sometimes useful where the location is important ex. medical scans. ### Effect on spatial dimensions With successive convolutions, the spatial dimension is reduced. The common solution around this is to use zero padding around the image. This is not big problem as corners and edges of images do not carry much information. ### Good practices - Resize the image to have a size in the power of 2 - Stride s=1 - A (3x3) filter works good with deep architectures - Add 1 layer of zero padding - Avoid hyperparams that do not "click" ex stride 2 with filter size 6x6 - No dropout - filters are much sparser, no room for co-dependencies and overfitting ## Pooling Aggregate multiple values into a single value - Brings invariance to small transformations - Reduces feature map size leading to faster computations - Keeps most important information for the next layer Max pooling $\frac{\partial a_{r c}}{\partial x_{i j}}=\left\{\begin{array}{cc}1, & \text { if } i=\mathrm{i}_{\max }, \mathrm{j}=\mathrm{j}_{\max } \\ 0, & \text { otherwise }\end{array}\right.$ Average pooling $\frac{\partial a_{r c}}{\partial x_{i j}}=\frac{1}{r \cdot c}$ ## Invariance properties ### Translation CNNs provide translation equivariance, not translation invariance. Here's the key distinction: **Translation Equivariance:** When you shift the input image, the feature maps shift by the same amount. If an object moves 5 pixels to the right in the input, the corresponding activation in the feature map also moves 5 pixels to the right. This is what convolutional layers naturally provide due to their weight-sharing mechanism. **Translation Invariance:** The output would remain exactly the same regardless of where objects appear in the input image. Pure CNNs don't achieve this. However, CNNs can achieve **approximate translation invariance** through: - Pooling operations (max pooling, average pooling) that downsample and reduce spatial sensitivity - Global operations like Global Average Pooling that completely remove spatial information - Data augmentation during training with translated versions of images So the convolutional layers themselves are equivariant, but the overall CNN architecture can approach invariance through additional operations that reduce spatial sensitivity. This combination is actually beneficial - you want to detect features regardless of their exact position (invariance for recognition) while maintaining spatial relationships for tasks like object detection or segmentation (equivariance). ### Other invariances Scale Invariance - CNNs are scale invariant to some degree. This is not because the convolutional filters are scale invariant, but because of scale variations present in the data. Rotation Invariance - CNNs are not rotaion invariant. Data augmentation can help. We can use data augmentation to learn invariances, but the ideal way is to use architectures that are invariant by design, such as [[Group Equivariant Convolutional Neural Networks]].