# Graph Attention Networks (GAT) Similarly to the [[Graph Convolutional Networks (GCN)]], the graph attention layer creates a message for each node using a linear layer/weight matrix. For the [[Attention Mechanism]] part, it uses the message from the node itself as a query, and the messages to average as both keys and values (note that this also includes the message to itself). ![[graph-attention-MLP.jpg]] [Image Credit](https://arxiv.org/abs/1710.10903) The score function $f_{attn}$ is implemented as a one-layer MLP which maps the query and key to a single value. $h_{i}$ and $h_{j}$ are the original features from node $i$ and $j$ respectively, and represent the messages of the layer with $\mathbf{W}$ as weight matrix. a is the weight matrix of the MLP, which has the shape $\left[1,2 \times d_{\text {message }}\right]$, and $\alpha_{i j}$ the final attention weight from node $i$ to $j$. The calculation can be described as follows. Note that $||$ is concatenation operator and $\mathcal{N}_{i}$ the indices of the neighbors of node $i$. $ \alpha_{i j}=\frac{\exp \left(\text { LeakyReLU }\left(\mathbf{a}\left[\mathbf{W} h_{i} \| \mathbf{W} h_{j}\right]\right)\right)}{\sum_{k \in \mathcal{N}_{i}} \exp \left(\text { LeakyReLU }\left(\mathbf{a}\left[\mathbf{W} h_{i} \| \mathbf{W} h_{k}\right]\right)\right)} $ Note that we apply ReLU before softmax. Without the non-linearity, the attention term with $h_i$ actually cancels itself out, resulting in the attention being independent of the node itself. The final output feature for each node is then given by, $ h_{i}^{\prime}=\sigma\left(\sum_{j \in \mathcal{N}_{i}} \alpha_{i j} \mathbf{W} h_{j}\right) $ ![[graph_attention.jpeg]] [Image Credit](https://arxiv.org/abs/1710.10903) --- ## References 1. Lecture 7.5, UvA Deep Learning course 2020