# CNNs for NLP
In NLP, the input is 1D instead of 2D like in [[Computer Vision]].
- The patches correspond to n-grams
- Input is not of fixed size
- input channels correspond to word embeddings instead of RGB
Convolutions with different kernel sizes i.e. n-gram orders can be applied in parallel.
- resulting features can vary in size, depending on padding and respective strides
- by max-over-time pooling, we obtain a fixed size representation that does not depend on the input length.
![[CNN for NLP.png]]
## CNNs for sequence classification
![[CNN for NLP Yoon Kim.png]]
- 2-gram and 3-gram convolutions
- 2 types of input channels: learnable and fixed (pre-trained) embeddings
- Results: Simple CNN model does very well!
## CNNs for Morphology
- What about units smaller than words?
- relevant when modeling unseen/rare inflections of words
- relevant for robustness with respect to noisy input with typos
- CNNs are popular for modeling sub-word units, especially at the character level
- Instead of learning a word embedding directly
- split input into tokens
- split each token into characters
- apply convolutions at character level
- for each token: combine by pooling
![[CNNs for morphology.png]]
- Here we assume that word boundaries are given
- Each token is represented by a fixed number of character embeddings
- Results in one representation layer per word, which can feed further into network of choice (CNNs, RNNs,...)
## CNNs and Contexts
- CNNs allow us to model local context using neighboring words
- neighborhood limited by kernel size
- I disagree with the other reviewers who say that this camera is not great.
- is max-over-time pooling sufficient to model this dependency?
How about:
- I disagree with most reviews but I agree with the reviewers who say that this camera is not great.
- I agree with most reviews but I disagree with the reviewers who say that this camera is not great.
- Here both agree and diagree are outside of (reasonably sized) kernels
- max-over-time pooling basically summarizes bags of convolutions
- How to model larger contexts?
- Increase kernel size?
- downside: becomes more sensitive to positional information
- Stack (many) CNNs?
- downside: input sequences of different length (fixed network topology)
- Instead of using flat n-grams, use linguistic structure to define contexts
- Tree-structured Convolution (Ma et al. 2015)
- [[Recurrent Neural Networks (RNN)]]
- [[Graph Convolutional Networks (GCN)]]
- generalize CNNs to arbitrary neighborhoods
- Standard CNNs define neighborhoods as $\lfloor k / 2\rfloor$ words to left and right
### Advantages and disadvantages
Advantages of CNNs for sequence classification
- simple architecture performs very well for many sequence classification tasks
- captures local context
- no feature engineering
- can benefit from pre-trained embeddings
Disadvantages of CNNs for sequence classification
- CNNs have limited receptive fields that only capture local patterns initially, requiring many layers to model long-range dependencies. Max-pooling operations lose fine-grained temporal information and positional details that are often crucial for sequence understanding.
## Why CNNs not ideal for Structural Modeling of Language
- To model language properly we need to be able to...
1. Consider unbounded histories
- n-grams, fixed limited kernels cannot achieve this
- stacking of convolutions expands the fixed-sized context, but is still fixed
- max-over-time pooling is unbounded, but ...
2. Consider structural/hierarchical properties of language
- n-grams, fixed limited kernels can only model fixed-sized, local structural properties
- max-over-time pooling is flat and order-insensitive
3. Input can be enriched by adding syntactic information, e.g., syntactic dependency information
- syntactic parsers are only available for some languages
- not clear which syntactic information is really required for which task (feature engineering)
---
## References