# Supervised Learning of Universal Sentence Representations from Natural Language Inference Data
Authors: A. Conneau, D. Kiela, H. Schwenk, L. Barrault, A. Bordes (Facebook AI)
Venue: EMNLP 2017, Outstanding Paper Award
Paper: https://arxiv.org/abs/1705.02364
Code: https://github.com/ihsgnef/InferSent-1
## The problem
- Efforts to obtain embeddings for larger chunks of text, such as sentences, have not been very successful following the success of word embeddings.
- Neural nets are very good at capturing the biases of the task on which they are trained, but can easily forget the overall information or seman- tics of the input data by specializing too much on these biases.
- Arguably the NLP community has not yet found the best supervised task for embedding the semantics of a whole sentence.
## The solution
- Shows that training on [[Natural Language Inference]] (NLI) task provides the best transferrability for NLP tasks.
## The details
- Models: 7 different architectures
- Standard [[LSTM]] and [[GRU]]
- BiGRU-last i.e. concatenation of last hidden states of forward and backward GRU
- BiLSTM with either mean or max pooling
- Inner [[Attention Mechanism]] networks (attention mechanism on top of BiLSTM)
- Hierarchical [[Convolutional Neural Networks (CNN)]] inspired by AdaSent (Zhao et al., 2015) where final representation is concatenation of maxpool of feature maps of each layer
- Hyperparams for all models
- SGD: 0.1 learning rate with decay of 0.99. After each epoch, divide learning rate by 5 if dev acc decreases. Traning stopped whne LR goes below 10^-5. Batch size 64
- Classifier: MLP with 1 hidden layer of 512 units.
- Word embeddings: [[Count-based Distributional Models|Glove]] with 300 dimensions
- Evaluation: Sentence embeddings used as features for 12 transfer tasks using [SentEval framework](https://github.com/facebookresearch/SentEval) (which uses a [[Logistic Regression]] fitted with Adam, batch size 64)
- Binary and Multiclass classification
- [[Sentiment Analysis]] (MR and SST datasets)
- Question type (TREC dataset)
- Product reviews (CR dataset)
- Subjectivity/objectivity (SUBJ dataset)
- Opinion polarity (MPQA dataset)
- Entailment and sematic relatedness - SICK-E and SICK-R dataset, reports pearson correlation
- Semantic Textual Similairty - unsupervised SemEval tasks of STS14 dataset, score between 0 and 5
- Paraphrase detection - MS Paraphrase Corpus (MRPC), classify 2 labels
- Caption-image retrieval - Rank either a collection of images with given caption or vice versa (COCO dataset)
## The results
![[Results InferSent.png]]
- With much less data (570K compared to 64M sentences) but high-quality supervision from the SNLI dataset, embeddings consistently outperform the results obtained by SkipThought (SOTA before this work).
- Cosine metrics on the embedding space is much more semantically informative than SkipThough embedding space.
- Models trained on SNLI supervised training obtains better results than models trained on other supervised tasks.
- Pre-trained image representation from ResNet combined with embeddings from this work achieve competitive results compared to features learned directly on the image-caption retrieval task.
- Conclusion: Natural language inference task constrains the model to encode the semantic information of the input sentence, and that the information required to perform NLI is generally discriminative and informative.
---
## References