# Document Classification
How to model long inputs such as entire documents or paragraphs?
Approaches
1. Treat a document like a very long sequence
- concatenate all inputs
- perform (multi-layer) convolution
- max-over-time pool over the entire (document) sequence
- connect pooled layer with classification layer
2. Treat a document like sequence of sentences
- for each sentence
- perform (multi-layer) convolution over word n-grams
- max-over-time pool over the sentence sequence
- consider the sentence representations as input to the next layers
- perform (multi-layer) convolution over neighboring sentences
- max-over-time pool over results
- connect final pooled layer with classification layer
## Doc2Vec
Doc2Vec (Le and Mikolov, 2014) learns document/paragraph representation directly.
![[Doc2Vec.png]]
- Extends the Word2Vec's CBOW model to documents
- Each paragraph (document) is assigned a unique (embedding) representation
- Task: predict next word (train with SGD)
- The paragraph embedding will learn to capture (wider context) information that is required to correctly predict the next word, but is insufficiently captured by the (local) context word embeddings
How about Skip-gram model?
![[Doc2Vec-Skipgram.png]]
What can be done with paragraph (document) embeddings?
- cluster documents in a collection based on paragraph embedding similarity
- given a query representation, retrieve related documents based on embedding similarity
- classify new documents? Not directly!
New documents don't have paragraph embeddings yet
For each new document
- create a new (empty) paragraph embedding
- train word prediction with new document (using SGD) but: keep all parameters fixed except for the new paragraph embedding
Paragraph embeddings do not give us classes, to this end
- train a separate classifier mapping paragraph embeddings to classes (neural net, logistic regression,...)
---
## References