Multi Task Learning in NLP

# Multi Task Learning in NLP How can [[Multi-task Learning]] improve performance (Caruana, 1993)? - Data amplification - Introducing an auxiliary task means adding data and introducing regularisation. - Representation bias - Introducing an auxiliary task may lead to finding different local minima, i.e. lead to finding different representations in the hypothesis space. - Attribute selection - Eavesdropping ## MTL setup 1. Choose your tasks 2. Design the network architecture 3. Select the data 4. Task prioritization during training ## Network architecture - Full sharing - Model is shared, MTL is achieved on the data side, GPT3 - Hard sharing - Shared encoding, separate output layers - Hierarchical sharing - Separated into tasks like POS, NLI - Soft sharing - some module that controls information flow DecaNLP (McCann et al. 2018) - full sharing - defines multiple tasks as QA - ![[Pasted image 20210416170839.png]] Liu et al (2019) - hard sharing - uses BERT as encoder - ![[Pasted image 20210416170951.png]] UniT, multimodal transformer from Facebook AI by Hu and Singh 2021 - hard sharing - ![[Pasted image 20210416171056.png]] - ![[UniT.png]] LIMIT-Bert Zhou et al. (2020) - hard sharing - trains BERT with ELECTRA on 5 tasks and applies _Syntactic/Semantic Phrase Masking_ - ![[Pasted image 20210416171248.png]] Joint-many model of. Hashimoto el al (2017) - Hierarchical sharing - ![[Pasted image 20210416171555.png]] Sparse sharing - Sun et al. (2020) train an over parametrized network with masks per subtask, on POS tagging, NER, chunking with a CNN-LSTM. - Can also model hierarchical sharing and hard sharing. - ![[Pasted image 20210416171755.png]] ## Task prioritization - Random training - Uniform Task Selection (Søgaard and Goldberg, 2016). - Proportional Task Selection (Sahn et al., 2018). - Periodic task alternations - Dong et al. (2015) use periodic task alternations with equal training ratios for every task. - cirriculum learning (Bengi et al. 2009) - start with easy task, increase difficulty gradually - anti-curriculum learning - used in DecaLearning - Consecutive learning - used in Joint-many model - In one epoch, iterate over the datasets in order of complexity; - Introduce successive regularisation to avoid catastrophic forgetting. ## Task weights - Human supervision - Fixed curriculum through human supervision using per-task weight in the loss function - Self-paced learning - Dynamic adjustment of task weights to force tasks to learn at similar pace eg. GradNorm by Chen et al. (2018) or Dynamic Weight Average by Liu et al. (2019) - Progress-signal based curriculum - RL inspired - eg. dynamic task prioritization by Guo et al. (2018) ## What tasks to combine in MTL setup? STUDY 1 - "Identifying beneficial task relations for multi-task learning in deep neural networks" by Bingel and Søgaard (2017) STUDY 2 - "Multi-task learning of pairwise sequence classification tasks over disparate label spaces" by Augenstein etal. (2018) STUDY 3 - "Estimating the influence of auxiliary tasks for multi-task learning of sequence tagging tasks." by Schröder and Biemann (2020) ## Task relations examples - "Multitask Learning for Complaint Identification and Sentiment Analysis" (Singh et al., 2021) - "Multitask Learning of Negation and Speculation using Transformers" (Khandelwal, 2020) - "The Pragmatics behind Politics: Modelling Metaphor, Framing and Emotion in Political Discourse" (Cabot etal., 2020) - "Multi-Task Learning for Metaphor Detection with Graph Convolutional Neural Networks and Word Sense Disambiguation" (Le etal., 2020 ) - "Happy Are Those Who Grade without Seeing: A Multi-Task Learning Approach to Grade Essays Using Gaze Behaviour" (Mathias etal., 2020) --- ## References