# Attention Mechanism The concept of attention tries to incorporate this question into network architecture: How relevant is the ith element in the sequence relative to other elements in the same sequence? Basic idea: - don't try to learn one global representation for the source sequence (encoder) - rather learn context-sensitive token representations - when generating a target token, dynamically combine the most relevant source representations Attention can be thought of as a similarity matrix of the sequence elements. We are interested in which elements are more important for every element in the sequence, so we generate an attention-vector for each element. We can also use multiple vectors for each element and then use weighted average for the final attention-vector, which is called multi-headed attention. Initially, we add some flexibility with three abstract vectors Query, key and value by a linear transformation of the input i.e multiplying the element vector with three matrices $W^q$, $W^k$ and $W^v$. There is one Q, K and V vector for each sequence element, which are used to compute attention vectors for every element. Query - The query is a feature vector that describes what we are looking for in the sequence, i.e. what would we maybe want to pay attention to. - It is compared to every other vector to establish the weights for its own output y_i Keys: - The key feature vector roughly describes what the element is “offering”, or when it might be important. The keys should be designed such that we can identify the elements we want to pay attention to based on the query. - It is compared to every other vector to establish the weights for the output of the j-th vector y_j Values - The value vector is the one we want to average over. - It is used as part of the weighted sum to compute each output vector once the weights have been established. The overall formulation is given as: $ \begin{aligned} & \mathbf{c}=\sum_{i=1}^{n} \operatorname{sim}_{p}\left(W_{k} \mathbf{k}^{i}, W_{q} \mathbf{q}\right) W_{\nu} \mathbf{v}^{i} \\ & W_{q} \in \mathbb{R}^{n_{m} \times m_{q}}, W_{k} \in \mathbb{R}^{n_{m} \times m_{k}}, W_{v} \in \mathbb{R}^{n_{\vee} \times m_{v}} \end{aligned} $ Next we need to calculate a score to rate which elements we want to pay attention to. For this we need to specify a score function 𝑓𝑎𝑡𝑡𝑛 (or sim). The score function takes the query and a key as input, and output the score/attention weight of the query-key pair. It is usually implemented by simple similarity metrics like inner product, or a small MLP, which is then passed through softmax to generate a probability distribution. The attention mechanism is thus fully differentiable! $ \alpha_{i}=\frac{\exp \left(f_{\text {attn}}\left(\mathrm{key}_{i}, \text { query }\right)\right)}{\sum_{j} \exp \left(f_{\text {attn}}\left(\mathrm{key}_{j}, \text { query }\right)\right)}, \quad \text { out }=\sum_{i} \alpha_{i} \cdot \text { value }_{i} $ ![[attention_example.svg]] ## Hard-Attention Hard-attention can be used in problems like word align etc where one source token needs to be looked at instead of distribution over tokens. $ p_{i}= \begin{cases}1 & , \text { if } i=\underset{1 \leq i \leq n}{\arg \max } \frac{\exp \left(\operatorname{sim}\left(W_{k} \mathbf{k}^{i}, W_{q} \mathbf{q}\right)\right)}{\sum_{j=1}^{n} \exp \left(\operatorname{sim}\left(W_{k} \mathbf{k}^{j}, W_{q} \mathbf{q}\right)\right)} \\ 0, & \text { otherwise }\end{cases} $ Hard attention is problematic - cannot model many-to-one: $\sum_{i=1}^{n} p_{i}=1$ - more importantly, it is not differentiable (due to argmax) Both problems could be addressed by using an auxiliary loss criterion in addition to target prediction loss. ## Self-Attention The attention applied inside the Transformer architecture is called self-attention. In self-attention, each sequence element provides a key, value, and query. For each element, we perform an attention layer where based on its query, we check the similarity of the all sequence elements’ keys, and return a different, averaged value vector for each element. ### Scaled Dot Product Attention Input is set of queries $Q \in \mathbb{R}^{T \times d_{k}},$ keys $K \in \mathbb{R}^{T \times d_{k}}$ and values $V \in \mathbb{R}^{T \times d_{v}}$ where $T$ is the sequence length, and $d_{k}$ and $d_{v}$ are the hidden dimensionality for queries/keys and values respectively. The attention tensor is given as: $ \text { Attention }(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d_{k}}}\right) V $ Note that softmax can be sensitive to large input values, killing the gradient. Dividing by the square root of the dimension of the vectors (which is equivalent to the euclidean length of unit vector of $d_k$) prevents softmax from growing too large. ![[scaled_dot_product_attn.svg]] ```python def scaled_dot_product(q, k, v, mask=None): d_k = q.size()[-1] attn_logits = torch.matmul(q, k.transpose(-2, -1)) attn_logits = attn_logits / math.sqrt(d_k) if mask is not None: attn_logits = attn_logits.masked_fill(mask == 0, -9e15) attention = F.softmax(attn_logits, dim=-1) values = torch.matmul(attention, v) return values, attention ``` ### Multi-Head Attention To account for the fact that an element in sequence can have multiple interpretation or different relation to neighbors, we can combine several attention mechanisms with "multi-head" attention. We generate and use multiple sets of Q,K,V vectors. To get the final output attention vector, we multiply the concatenated individual attention vectors with $W^o$, which is then fed to the fully connected layer. Multi-headed attention improves the attention layer in the following ways: 1. Expands ability to focus on different positions. 2. Gives attention layer multiple "representation subspaces". --- ## References 1. Lecture 6.4, UvA Deep Learning course 2020