# Prototypical Networks for Few-shot Learning - Key idea: We can represent each class by the mean of its examples in representation space learned by a neural network. - Classification is then performed by simply finding the nearest class prototype in this representation space. - Simpler and more efficient than some meta-learning algorithms, making them an appealing approach to few-shot and zero-shot learning. ## Model - Prototypical networks compute an $M$ -dimensional representation $\mathbf{c}_{k} \in \mathbb{R}^{M}$, or prototype, of each class through an embedding function $f_{\phi}: \mathbb{R}^{D} \rightarrow \mathbb{R}^{M}$ with learnable parameters $\phi .$ Each prototype is the mean vector of the embedded support points belonging to its class: $ \mathbf{c}_{k}=\frac{1}{\left|S_{k}\right|} \sum_{\left(\mathbf{x}_{i}, y_{i}\right) \in S_{k}} f_{\phi}\left(\mathbf{x}_{i}\right) $ - Given a distance function $d: \mathbb{R}^{M} \times \mathbb{R}^{M} \rightarrow[0,+\infty)$, prototypical networks produce a distribution over classes for a query point $\mathrm{x}$ based on a softmax over distances to the prototypes in the embedding space: $ p_{\phi}(y=k \mid \mathbf{x})=\frac{\exp \left(-d\left(f_{\phi}(\mathbf{x}), \mathbf{c}_{k}\right)\right)}{\sum_{k^{\prime}} \exp \left(-d\left(f_{\phi}(\mathbf{x}), \mathbf{c}_{k^{\prime}}\right)\right)} $ - Empirically it is shown that the choice of distance is vital, as euclidean distance performs far better than cosine similarity. ## Interpretation as a linear model - When we use Euclidean distance $d\left(\mathbf{z}, \mathbf{z}^{\prime}\right)=\left\|\mathbf{z}-\mathbf{z}^{\prime}\right\|^{2}$, then the model is equivalent to a linear model with a particular parameterization. To see this, expand the term in the exponent: $ -\left\|f_{\phi}(\mathbf{x})-\mathbf{c}_{k}\right\|^{2}=-f_{\phi}(\mathbf{x})^{\top} f_{\phi}(\mathbf{x})+2 \mathbf{c}_{k}^{\top} f_{\phi}(\mathbf{x})-\mathbf{c}_{k}^{\top} \mathbf{c}_{k} $ - The first term is constant with respect to the class $k$, so it does not affect the softmax probabilities. We can write the remaining terms as a linear model as follows: $ 2 \mathbf{c}_{k}^{\top} f_{\phi}(\mathbf{x})-\mathbf{c}_{k}^{\top} \mathbf{c}_{k}=\mathbf{w}_{k}^{\top} f_{\phi}(\mathbf{x})+b_{k}, \text { where } \mathbf{w}_{k}=2 \mathbf{c}_{k} \text { and } b_{k}=-\mathbf{c}_{k}^{\top} \mathbf{c}_{k} $ --- ## References 1. [Prototypical Networks for Few-shot Learning](https://arxiv.org/pdf/1703.05175.pdf), Snell et al. 2017