# Supervised Learning of Universal Sentence Representations from Natural Language Inference Data Authors: A. Conneau, D. Kiela, H. Schwenk, L. Barrault, A. Bordes (Facebook AI) Venue: EMNLP 2017, Outstanding Paper Award Paper: https://arxiv.org/abs/1705.02364 Code: https://github.com/ihsgnef/InferSent-1 ## The problem - Efforts to obtain embeddings for larger chunks of text, such as sentences, have not been very successful following the success of word embeddings. - Neural nets are very good at capturing the biases of the task on which they are trained, but can easily forget the overall information or seman- tics of the input data by specializing too much on these biases. - Arguably the NLP community has not yet found the best supervised task for embedding the semantics of a whole sentence. ## The solution - Shows that training on [[Natural Language Inference]] (NLI) task provides the best transferrability for NLP tasks. ## The details - Models: 7 different architectures - Standard [[LSTM]] and [[GRU]] - BiGRU-last i.e. concatenation of last hidden states of forward and backward GRU - BiLSTM with either mean or max pooling - Inner [[Attention Mechanism]] networks (attention mechanism on top of BiLSTM) - Hierarchical [[Convolutional Neural Networks (CNN)]] inspired by AdaSent (Zhao et al., 2015) where final representation is concatenation of maxpool of feature maps of each layer - Hyperparams for all models - SGD: 0.1 learning rate with decay of 0.99. After each epoch, divide learning rate by 5 if dev acc decreases. Traning stopped whne LR goes below 10^-5. Batch size 64 - Classifier: MLP with 1 hidden layer of 512 units. - Word embeddings: [[Count-based Distributional Models|Glove]] with 300 dimensions - Evaluation: Sentence embeddings used as features for 12 transfer tasks using [SentEval framework](https://github.com/facebookresearch/SentEval) (which uses a [[Logistic Regression]] fitted with Adam, batch size 64) - Binary and Multiclass classification - [[Sentiment Analysis]] (MR and SST datasets) - Question type (TREC dataset) - Product reviews (CR dataset) - Subjectivity/objectivity (SUBJ dataset) - Opinion polarity (MPQA dataset) - Entailment and sematic relatedness - SICK-E and SICK-R dataset, reports pearson correlation - Semantic Textual Similairty - unsupervised SemEval tasks of STS14 dataset, score between 0 and 5 - Paraphrase detection - MS Paraphrase Corpus (MRPC), classify 2 labels - Caption-image retrieval - Rank either a collection of images with given caption or vice versa (COCO dataset) ## The results ![[Results InferSent.png]] - With much less data (570K compared to 64M sentences) but high-quality supervision from the SNLI dataset, embeddings consistently outperform the results obtained by SkipThought (SOTA before this work). - Cosine metrics on the embedding space is much more semantically informative than SkipThough embedding space. - Models trained on SNLI supervised training obtains better results than models trained on other supervised tasks. - Pre-trained image representation from ResNet combined with embeddings from this work achieve competitive results compared to features learned directly on the image-caption retrieval task. - Conclusion: Natural language inference task constrains the model to encode the semantic information of the input sentence, and that the information required to perform NLI is generally discriminative and informative. --- ## References