# Graph Attention Networks (GAT)
Similarly to the [[Graph Convolutional Networks (GCN)]], the graph attention layer creates a message for each node using a linear layer/weight matrix. For the [[Attention Mechanism]] part, it uses the message from the node itself as a query, and the messages to average as both keys and values (note that this also includes the message to itself).
![[graph-attention-MLP.jpg]]
[Image Credit](https://arxiv.org/abs/1710.10903)
The score function $f_{attn}$ is implemented as a one-layer MLP which maps the query and key to a single value. $h_{i}$ and $h_{j}$ are the original features from node $i$ and $j$ respectively, and represent the messages of the layer with $\mathbf{W}$ as weight matrix. a is the weight matrix of the MLP, which has the shape $\left[1,2 \times d_{\text {message }}\right]$, and $\alpha_{i j}$ the final attention weight from node $i$ to $j$.
The calculation can be described as follows. Note that $||$ is concatenation operator and $\mathcal{N}_{i}$ the indices of the neighbors of node $i$.
$
\alpha_{i j}=\frac{\exp \left(\text { LeakyReLU }\left(\mathbf{a}\left[\mathbf{W} h_{i} \| \mathbf{W} h_{j}\right]\right)\right)}{\sum_{k \in \mathcal{N}_{i}} \exp \left(\text { LeakyReLU }\left(\mathbf{a}\left[\mathbf{W} h_{i} \| \mathbf{W} h_{k}\right]\right)\right)}
$
Note that we apply ReLU before softmax. Without the non-linearity, the attention term with $h_i$ actually cancels itself out, resulting in the attention being independent of the node itself.
The final output feature for each node is then given by,
$
h_{i}^{\prime}=\sigma\left(\sum_{j \in \mathcal{N}_{i}} \alpha_{i j} \mathbf{W} h_{j}\right)
$
![[graph_attention.jpeg]]
[Image Credit](https://arxiv.org/abs/1710.10903)
---
## References
1. Lecture 7.5, UvA Deep Learning course 2020