# Multi Task Learning in NLP
How can [[Multi-task Learning]] improve performance (Caruana, 1993)?
- Data amplification
- Introducing an auxiliary task means adding data and introducing regularisation.
- Representation bias
- Introducing an auxiliary task may lead to finding different local minima, i.e. lead to finding different representations in the hypothesis space.
- Attribute selection
- Eavesdropping
## MTL setup
1. Choose your tasks
2. Design the network architecture
3. Select the data
4. Task prioritization during training
## Network architecture
- Full sharing
- Model is shared, MTL is achieved on the data side, GPT3
- Hard sharing
- Shared encoding, separate output layers
- Hierarchical sharing
- Separated into tasks like POS, NLI
- Soft sharing
- some module that controls information flow
DecaNLP (McCann et al. 2018)
- full sharing
- defines multiple tasks as QA
- ![[Pasted image 20210416170839.png]]
Liu et al (2019)
- hard sharing
- uses BERT as encoder
- ![[Pasted image 20210416170951.png]]
UniT, multimodal transformer from Facebook AI by Hu and Singh 2021
- hard sharing
- ![[Pasted image 20210416171056.png]]
- ![[UniT.png]]
LIMIT-Bert Zhou et al. (2020)
- hard sharing
- trains BERT with ELECTRA on 5 tasks and applies _Syntactic/Semantic Phrase Masking_
- ![[Pasted image 20210416171248.png]]
Joint-many model of. Hashimoto el al (2017)
- Hierarchical sharing
- ![[Pasted image 20210416171555.png]]
Sparse sharing
- Sun et al. (2020) train an over parametrized network with masks per subtask, on POS tagging, NER, chunking with a CNN-LSTM.
- Can also model hierarchical sharing and hard sharing.
- ![[Pasted image 20210416171755.png]]
## Task prioritization
- Random training
- Uniform Task Selection (Søgaard and Goldberg, 2016).
- Proportional Task Selection (Sahn et al., 2018).
- Periodic task alternations
- Dong et al. (2015) use periodic task alternations with equal training ratios for every task.
- cirriculum learning (Bengi et al. 2009)
- start with easy task, increase difficulty gradually
- anti-curriculum learning - used in DecaLearning
- Consecutive learning - used in Joint-many model
- In one epoch, iterate over the datasets in order of complexity;
- Introduce successive regularisation to avoid catastrophic forgetting.
## Task weights
- Human supervision
- Fixed curriculum through human supervision using per-task weight in the loss function
- Self-paced learning
- Dynamic adjustment of task weights to force tasks to learn at similar pace eg. GradNorm by Chen et al. (2018) or Dynamic Weight Average by Liu et al. (2019)
- Progress-signal based curriculum
- RL inspired - eg. dynamic task prioritization by Guo et al. (2018)
## What tasks to combine in MTL setup?
STUDY 1
- "Identifying beneficial task relations for multi-task learning in deep neural networks" by Bingel and Søgaard (2017)
STUDY 2
- "Multi-task learning of pairwise sequence classification tasks over disparate label spaces" by Augenstein etal. (2018)
STUDY 3
- "Estimating the influence of auxiliary tasks for multi-task learning of sequence tagging tasks." by Schröder and Biemann (2020)
## Task relations examples
- "Multitask Learning for Complaint Identification and Sentiment Analysis" (Singh et al., 2021)
- "Multitask Learning of Negation and Speculation using Transformers" (Khandelwal, 2020)
- "The Pragmatics behind Politics: Modelling Metaphor, Framing and Emotion in Political Discourse" (Cabot etal., 2020)
- "Multi-Task Learning for Metaphor Detection with Graph Convolutional Neural Networks and Word Sense Disambiguation" (Le etal., 2020 )
- "Happy Are Those Who Grade without Seeing: A Multi-Task Learning Approach to Grade Essays Using Gaze Behaviour" (Mathias etal., 2020)
---
## References