# Text Graph Transformer for Document Classification Authors: Zhang et al. Date: EMNLP 2020 Paper: https://www.aclweb.org/anthology/2020.emnlp-main.668.pdf Video: https://slideslive.com/38938916/text-graph-transformer-for-document-classification ## The problem - Document-word graph capture global word co-occurance in a corpus. Using these graphs with [[Graph Convolutional Networks (GCN)]] and [[Graph Attention Networks (GAT)]] have shown to be promising but have two limitations: - Not scalable to large sized corpus due to high computation and memory costs. GCN has time complexity $\mathcal{O}\left(\operatorname{Lnd}^{2}\right)$ and memory complexity $\mathcal{O}\left(\right.$ Lnd $\left.+L d^{2}\right)$ - They ignore heterogeneity of the graph, which contain both document and word nodes. - Loss function is decomposed into individual independant sample, and allows using SGD with mini-batches. - In graph, nodes are interrelated via edges, so sampling can introduce bias, need to guarentee that sampled subgraph maintains meaningful structure that GNN can exploit. - In popular architectures, full-batch gradient descent is used as whole adjaceny matrix and node features are kept in memory. ## The solution - Instead of learning based on full-text graph, they propose a sampling method that allows subgraph mini-batch training. - Distinguish the learning process of different type of nodes to fully utilize heterogeneity of text graph (word and documents). ![[TG-Transformer.png]] ## The details - The Document-word Text Graph - ![[Document-word graph.png]] - Nodes - Word nodes represents all documents in the corpus. - Document nodes represtntall the words in the corpus vocabulary. - Edges - Word-document edges - Word occurance within documents - Edge weights measured by [[TF-IDF]] - Word-word edges - Local word-occurrence within sliding window in the corpus - Edge weights measured by [[Pointwise Mutual Information (PMI)]] - $\operatorname{PMI}\left(w_{i}, w_{j}\right)=\log \frac{p_{i, j}}{p_{i} p_{j}}=\log \frac{N_{i, j} N}{N_{i} N_{j}}$, where $N_{i}, N_{j}, N_{i, j}$ are the number of sliding windows in a corpus that contain word $w_{i}$, word $w_{j}$ and both $w_{i}, w_{j} . N$ is the total number of sliding windows in the corpus. - Sampling - TG-Transformer is trained on subgraph mini-batch, which is a unsupervised pre-process step. It allows scalability to large-sized corpus. - Initimacy matrix S based on pagerank algorithm: $\mathbf{S}=\alpha \cdot(\mathbf{I}-(1-\alpha) \cdot \overline{\mathbf{A}})^{-1}$ where factor $\alpha \in[0,1]$ is usually set as $0.15$ $\overline{\mathbf{A}}=\mathbf{D}^{-\frac{1}{2}} \mathbf{A D}^{-\frac{1}{2}}$ is the normalized symmetric adjacency matrix, $\mathbf{A}$ is the adjacency matrix of the text graph, and $\mathrm{D}$ is its corresponding diagonal matrix. - For any document node $v_{i} \in \mathcal{V}$, sample its context subgraph $\mathcal{C}\left(v_{i}\right)$ of size $k$ by selecting its top $k$ intimate neighbour word nodes $u_{j} \in \mathcal{U}$. - For word node $u_{j} \in \mathcal{U}$, calculate ratios of two type of incident edge: - $r_{w}\left(u_{i}\right)=\frac{\left|\mathcal{F}\left(u_{i}\right)\right|}{\left|\mathcal{F}\left(u_{i}\right)\right|+\left|\mathcal{E}\left(u_{i}\right)\right|}$ and $r_{d}\left(u_{i}\right)=\frac{\left|\mathcal{E}\left(u_{i}\right)\right|}{\left|\mathcal{F}\left(u_{i}\right)\right|+\left|\mathcal{E}\left(u_{i}\right)\right|}$ - where $\mathcal{F}\left(u_{i}\right), \mathcal{E}\left(u_{i}\right)$ are the sets of word-word edges, word-document edges incident to $u_{i}$ with intimacy score larger than threshold $\theta$. We sample its context subgraph $\mathcal{C}\left(u_{i}\right)$ of size $k$ by selecting its top $k \cdot r_{w}\left(u_{i}\right)$ intimate neighbour word nodes and its top $k \cdot r_{d}\left(u_{i}\right))$ intimate neighbour document nodes, respectively. - Structural encodings - Heterogeneity encoding - 0 for document nodes, 1 for word nodes. - WL Structural Encoding - WL algorithm labels nodes according to their structural roles in the graph. - This label is used as a replacement for position in [[Positional Encoding]] and is given as $\left[\sin \left(\frac{\mathrm{WL}\left(v_{j}\right)}{10000^{\frac{2 l}{d_{h}}}}\right), \cos \left(\frac{\mathrm{WL}\left(v_{j}\right)}{10000^{\frac{2 l+1}{d_{h}}}}\right)\right]_{l=0}^{\left\lfloor\frac{d_{h}}{2}\right\rfloor}$ - These encoding have same dimension ($d_h$) as original raw feature embeddings so they are all added together to get initial node representation for the input subgraph, denoted as $\mathbf{H}^{(\mathbf{0})}$. - Graph Transformer Layer - Graph [[Transformers]] Layer learns target node representation $\mathbf{H}^{(l)} = \operatorname{softmax}\left(\frac{\mathbf{Q K}^{\top}}{\sqrt{d_{h}}}\right) \mathbf{V}+\operatorname{G-Res}$ - G-Res refers to graph residual term (skip connection), helps solve the over-smoothing issue of GNNs - Finally, to get class label a softmax classifier is used with averaged final representations $\mathbf{z}=\operatorname{softmax}\left(\operatorname{average}\left(\mathbf{H}^{(D)}\right)\right.$ Implementation - Node representation dimension is 300, initialized with Glove embeddings - 2-layer graph transformer with hidden size 32 and 4 attention heads - dropout rate 0.5, learning rate 0.001, weight decay and early stopping with validation set ## The results ![[TG-Transformer Results.png]] ![[TG-Transformer training time.png]] - Outperforms all graph models, with much less memory and computing costs. - Hyperparameter k, which controls subgraph size in sampling is important. Increasing improves performance steadily until an ideal size. - Ablation study - Two structural encodings are important to capture useful heterogenous graph structure information - Initialization with pre-trained embeddings are important - Separate model for word and document nodes is beneficial. --- ## References 1. https://towardsdatascience.com/simple-scalable-graph-neural-networks-7eb04f366d07