# Text Graph Transformer for Document Classification
Authors: Zhang et al.
Date: EMNLP 2020
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.668.pdf
Video: https://slideslive.com/38938916/text-graph-transformer-for-document-classification
## The problem
- Document-word graph capture global word co-occurance in a corpus. Using these graphs with [[Graph Convolutional Networks (GCN)]] and [[Graph Attention Networks (GAT)]] have shown to be promising but have two limitations:
- Not scalable to large sized corpus due to high computation and memory costs. GCN has time complexity $\mathcal{O}\left(\operatorname{Lnd}^{2}\right)$ and memory complexity $\mathcal{O}\left(\right.$ Lnd $\left.+L d^{2}\right)$
- They ignore heterogeneity of the graph, which contain both document and word nodes.
- Loss function is decomposed into individual independant sample, and allows using SGD with mini-batches.
- In graph, nodes are interrelated via edges, so sampling can introduce bias, need to guarentee that sampled subgraph maintains meaningful structure that GNN can exploit.
- In popular architectures, full-batch gradient descent is used as whole adjaceny matrix and node features are kept in memory.
## The solution
- Instead of learning based on full-text graph, they propose a sampling method that allows subgraph mini-batch training.
- Distinguish the learning process of different type of nodes to fully utilize heterogeneity of text graph (word and documents).
![[TG-Transformer.png]]
## The details
- The Document-word Text Graph
- ![[Document-word graph.png]]
- Nodes
- Word nodes represents all documents in the corpus.
- Document nodes represtntall the words in the corpus vocabulary.
- Edges
- Word-document edges
- Word occurance within documents
- Edge weights measured by [[TF-IDF]]
- Word-word edges
- Local word-occurrence within sliding window in the corpus
- Edge weights measured by [[Pointwise Mutual Information (PMI)]]
- $\operatorname{PMI}\left(w_{i}, w_{j}\right)=\log \frac{p_{i, j}}{p_{i} p_{j}}=\log \frac{N_{i, j} N}{N_{i} N_{j}}$, where $N_{i}, N_{j}, N_{i, j}$ are the number of sliding windows in a corpus that contain word $w_{i}$, word $w_{j}$ and both $w_{i}, w_{j} . N$ is the total number of sliding windows in the corpus.
- Sampling
- TG-Transformer is trained on subgraph mini-batch, which is a unsupervised pre-process step. It allows scalability to large-sized corpus.
- Initimacy matrix S based on pagerank algorithm: $\mathbf{S}=\alpha \cdot(\mathbf{I}-(1-\alpha) \cdot \overline{\mathbf{A}})^{-1}$ where factor $\alpha \in[0,1]$ is usually set as $0.15$ $\overline{\mathbf{A}}=\mathbf{D}^{-\frac{1}{2}} \mathbf{A D}^{-\frac{1}{2}}$ is the normalized symmetric adjacency matrix, $\mathbf{A}$ is the adjacency matrix of the text graph, and $\mathrm{D}$ is its corresponding diagonal matrix.
- For any document node $v_{i} \in \mathcal{V}$, sample its context subgraph $\mathcal{C}\left(v_{i}\right)$ of size $k$ by selecting its top $k$ intimate neighbour word nodes $u_{j} \in \mathcal{U}$.
- For word node $u_{j} \in \mathcal{U}$, calculate ratios of two type of incident edge:
- $r_{w}\left(u_{i}\right)=\frac{\left|\mathcal{F}\left(u_{i}\right)\right|}{\left|\mathcal{F}\left(u_{i}\right)\right|+\left|\mathcal{E}\left(u_{i}\right)\right|}$ and $r_{d}\left(u_{i}\right)=\frac{\left|\mathcal{E}\left(u_{i}\right)\right|}{\left|\mathcal{F}\left(u_{i}\right)\right|+\left|\mathcal{E}\left(u_{i}\right)\right|}$
- where $\mathcal{F}\left(u_{i}\right), \mathcal{E}\left(u_{i}\right)$ are the sets of word-word edges, word-document edges incident to $u_{i}$ with intimacy score larger than threshold $\theta$. We sample its context subgraph $\mathcal{C}\left(u_{i}\right)$ of size $k$ by selecting its top $k \cdot r_{w}\left(u_{i}\right)$ intimate neighbour word nodes and its top $k \cdot r_{d}\left(u_{i}\right))$ intimate neighbour document nodes, respectively.
- Structural encodings
- Heterogeneity encoding
- 0 for document nodes, 1 for word nodes.
- WL Structural Encoding
- WL algorithm labels nodes according to their structural roles in the graph.
- This label is used as a replacement for position in [[Positional Encoding]] and is given as $\left[\sin \left(\frac{\mathrm{WL}\left(v_{j}\right)}{10000^{\frac{2 l}{d_{h}}}}\right), \cos \left(\frac{\mathrm{WL}\left(v_{j}\right)}{10000^{\frac{2 l+1}{d_{h}}}}\right)\right]_{l=0}^{\left\lfloor\frac{d_{h}}{2}\right\rfloor}$
- These encoding have same dimension ($d_h$) as original raw feature embeddings so they are all added together to get initial node representation for the input subgraph, denoted as $\mathbf{H}^{(\mathbf{0})}$.
- Graph Transformer Layer
- Graph [[Transformers]] Layer learns target node representation $\mathbf{H}^{(l)} = \operatorname{softmax}\left(\frac{\mathbf{Q K}^{\top}}{\sqrt{d_{h}}}\right) \mathbf{V}+\operatorname{G-Res}$
- G-Res refers to graph residual term (skip connection), helps solve the over-smoothing issue of GNNs
- Finally, to get class label a softmax classifier is used with averaged final representations $\mathbf{z}=\operatorname{softmax}\left(\operatorname{average}\left(\mathbf{H}^{(D)}\right)\right.$
Implementation
- Node representation dimension is 300, initialized with Glove embeddings
- 2-layer graph transformer with hidden size 32 and 4 attention heads
- dropout rate 0.5, learning rate 0.001, weight decay and early stopping with validation set
## The results
![[TG-Transformer Results.png]]
![[TG-Transformer training time.png]]
- Outperforms all graph models, with much less memory and computing costs.
- Hyperparameter k, which controls subgraph size in sampling is important. Increasing improves performance steadily until an ideal size.
- Ablation study
- Two structural encodings are important to capture useful heterogenous graph structure information
- Initialization with pre-trained embeddings are important
- Separate model for word and document nodes is beneficial.
---
## References
1. https://towardsdatascience.com/simple-scalable-graph-neural-networks-7eb04f366d07