Neural Networks - Notes on AI

# Neural Networks Models with fixed basis functions have useful analytical and computational properties, but their practical applicability is limited by the curse of dimensionality. An alternative approach is to fix the number of basis functions in advance and allow them to be adaptive with parametric forms, in which the values are adapted during training. The most successful models of this type are Neural Networks, also called _multilayer perceptron_ (although NNs use continuous non-linearities instead of step nonlinearity of [[Perceptron]]). The price paid for this flexibility is that the likelihood function, necessary for network training, is no longer a convex function of model parameters. They are biologically inspired models with a long and rich [[History of Neural Networks|history]], and are responsible for the [[Deep Learning]] revolution of 2010s. ## Feed-forward Neural Networks We know, the generalized models for regression and classifications take the form $ y(\mathbf{x}, \mathbf{w})=f\left(\mathbf{w}^{T} \phi(\mathbf{x})\right) $ where $f$ is a nonlinear activation function in case of classification and identitiy in case of regression. Neural networks use the same generalized forms to model the basis functions, and thus create flexible non-linear features and learns them. $ \phi_{m}\left(\mathbf{x}, \mathbf{w}_{m}^{(1)}\right)=h\left(\left(\mathbf{w}_{m}^{(1)}\right)^{T} \mathbf{x}\right)=h\left(\sum_{d=0}^{D} w_{m d}^{(1)} x_{d}\right) $ where $m$ is the number of basis functions. The superscript of $\mathbf{w}$ represents the "layer". Two layer neural networks is described for regression as, $ y\left(\mathbf{x}, \mathbf{W}^{(1)}, \mathbf{w}^{(2)}\right)=\sum_{m=0}^{M} w_{m}^{(2)} h\left(\sum_{d=0}^{D} w_{m d}^{(1)} x_{d}\right) = \mathbf{w}^{(2)^T}.h(\mathbf{w^{(1)}}^T\mathbf{x}) $ And for classification, it takes the form with nonlinearity $f$ on the outputs as, $ \left.y\left(\mathbf{x}, \mathbf{W}^{(1)}, \mathbf{w}^{(2)}\right)=f\left(\mathbf{w}^{(2)^{\top}} h\left(w^{\left(1)^{\top}\right.} \mathbf{x}\right)\right)\right) $ In general form, the following common notations are, $ \begin{align} \mathbf{x} &= [1, x_1,...,x_D] \in \mathbb{R}^{D+1} \\ \mathbf{a} &= [a_1,a_2,...,a_M] = \mathbf{w^{(l-1)}}\mathbf{x} \in \mathbb{R}^M \\ \mathbf{z} &= [z_1,z_2,...,z_M] = h(\mathbf{a}) \in \mathbb{R}^M \\ \mathbf{y} &= [y_1,y_2,...,y_K] = \mathbf{w^{(l)}}\mathbf{z} \in \mathbb{R}^K \end{align} $ ## Other Variants ### Skip Connections Feed-forward networks can have skip connections explicitly as shown in the figure below. Infact, the only limitation in _feed-forward_ architecture is that there should be no close directed cycles to ensure the outputs are deterministic functions of the inputs. Skip-connections have desirable property of [[Depth and Trainability#Smoothening the loss surface]] which enable training deeper networks. Example: [[ResNet]] ![[nn with skip connections.jpg]] ### Sparse Connections Similarly, feed-forward architectures can also have sparse connections. These type of architectures are highly parameter efficient, and also allow to incorporate prior knowledge based structure such as weight sharing in [[Convolutional Neural Networks (CNN)]]. ### Recurrent Connections Recurrence is a mechanism which allows neural networks to be of arbitrary depth, but with shared parameters over the layers. [[Recurrent Neural Networks (RNN)]] exploit such recurrent structure. Recurrence enables neural networks to work with sequences and data with temporal dependancies. ![[unfolding-rnns.jpg]] --- ## References