# Convolutional Neural Networks (CNN) ## What makes images spatial? What makes images different? - Spatial dimensions (2D projection of 3D world) - Extrememly high dimensional image (1920x1080x3 = 6M pixels!) - Input dimensions are extremely correlated (we can't percieve even small shifts, even though values change drastically) - Cropping/shifting/occluding dimensions are still an image - Possibly with same semantics - Basic natural image statistics are the same Convolutional neural networks - Adds inductive bias to neural networks to deal with spatial signals - Use convolutional filters to encode spatial structure - Use local connectivity, parameter sharing, translation equivariance, to account for the huge input dimensionalities - Use spatial pooling to remain robust to local variations Why spatial? - Images are 2D (or 3D if you count the channels) - What does 2D input really mean? Neighboring variables are locally correlated Learnable filters - Image processing has many handcrafted filters. But are they optimal for recognition? - Can we learn optimal filters from our data instead? Will they resemble the handcrafted filters? Hypothesis (Actually work out!) - Image statistics are not location dependent - Natural images are stationary - The same filters should work on every corner of the image similarly - Perhaps move and reuse the same (red, yellow, green) filter across the whole image? ## Convolution [[Convolution]] are operators that "blend" two functions. In cases of images, we define convolution as (filter slider over the image) $ I(x, y) * h=\sum_{i=-a}^{a} \sum_{j=-b}^{b} \overbrace{I(x-i, y-j) \cdot h(i, j)}^{\text {Inner product }} $ ![[conv-cross-auto.jpg]] The dimensions after convolutions: ![[conv-output-dims.jpg]] ### Local Connectivity Local connectivity implies: - Share weights spatially (after translation) - Surface-wise local - Depth-wise (across channels) global In MLPs (fully connected), no local connectivity. Everything connected to everything. No notion of "space", surface or depth Shuffling the pixels will cause no difference in the output. But local connectivity doesn't necessarily mean convolutional filters. ![[local-but-nonsharing.jpg]] Using local connectivity but without sharing parameters (like above) can be sometimes useful where the location is important ex. medical scans. ### Effect on spatial dimensions With successive convolutions, the spatial dimension is reduced. The common solution around this is to use zero padding around the image. This is not big problem as corners and edges of images do not carry much information. ### Good practices - Resize the image to have a size in the power of 2 - Stride s=1 - A (3x3) filter works good with deep architectures - Add 1 layer of zero padding - Avoid hyperparams that do not "click" ex stride 2 with filter size 6x6 - No dropout - filters are much sparser, no room for co-dependencies and overfitting ## Pooling Aggregate multiple values into a single value - Brings invariance to small transformations - Reduces feature map size leading to faster computations - Keeps most important information for the next layer Max pooling $\frac{\partial a_{r c}}{\partial x_{i j}}=\left\{\begin{array}{cc}1, & \text { if } i=\mathrm{i}_{\max }, \mathrm{j}=\mathrm{j}_{\max } \\ 0, & \text { otherwise }\end{array}\right.$ Average pooling $\frac{\partial a_{r c}}{\partial x_{i j}}=\frac{1}{r \cdot c}$ ## Invariance properties Translation Invariance - The representation are not too different when images are shifted. Scale Invariance - CNNs are scale invariant to some degree. This is not because the convolutional filters are scale invariant, but because of scale variations present in the data. Rotation Invariance - CNNs are not rotaion invariant. Data augmentation can help. We can use data augmentation to learn invariances, but the ideal way is to use architectures that are invariant by design, such as [[Group Equivariant Convolutional Neural Networks]]. --- ## References 1. Lecture 4, UvA DL course 2020