CNNs for NLP - Notes on AI

# CNNs for NLP In NLP, the input is 1D instead of 2D like in [[Computer Vision]]. - The patches correspond to n-grams - Input is not of fixed size - input channels correspond to word embeddings instead of RGB Convolutions with different kernel sizes i.e. n-gram orders can be applied in parallel. - resulting features can vary in size, depending on padding and respective strides - by max-over-time pooling, we obtain a fixed size representation that does not depend on the input length. ![[CNN for NLP.png]] ## CNNs for sequence classification ![[CNN for NLP Yoon Kim.png]] - 2-gram and 3-gram convolutions - 2 types of input channels: learnable and fixed (pre-trained) embeddings - Results: Simple CNN model does very well! ## CNNs for Morphology - What about units smaller than words? - relevant when modeling unseen/rare inflections of words - relevant for robustness with respect to noisy input with typos - CNNs are popular for modeling sub-word units, especially at the character level - Instead of learning a word embedding directly - split input into tokens - split each token into characters - apply convolutions at character level - for each token: combine by pooling ![[CNNs for morphology.png]] - Here we assume that word boundaries are given - Each token is represented by a fixed number of character embeddings - Results in one representation layer per word, which can feed further into network of choice (CNNs, RNNs,...) ## CNNs and Contexts - CNNs allow us to model local context using neighboring words - neighborhood limited by kernel size - I disagree with the other reviewers who say that this camera is not great. - is max-over-time pooling sufficient to model this dependency? How about: - I disagree with most reviews but I agree with the reviewers who say that this camera is not great. - I agree with most reviews but I disagree with the reviewers who say that this camera is not great. - Here both agree and diagree are outside of (reasonably sized) kernels - max-over-time pooling basically summarizes bags of convolutions - How to model larger contexts? - Increase kernel size? - downside: becomes more sensitive to positional information - Stack (many) CNNs? - downside: input sequences of different length (fixed network topology) - Instead of using flat n-grams, use linguistic structure to define contexts - Tree-structured Convolution (Ma et al. 2015) - [[Recurrent Neural Networks (RNN)]] - [[Graph Convolutional Networks (GCN)]] - generalize CNNs to arbitrary neighborhoods - Standard CNNs define neighborhoods as $\lfloor k / 2\rfloor$ words to left and right ### Advantages and disadvantages Advantages of CNNs for sequence classification - simple architecture performs very well for many sequence classification tasks - captures local context - no feature engineering - can benefit from pre-trained embeddings Disadvantages of CNNs for sequence classification - CNNs have limited receptive fields that only capture local patterns initially, requiring many layers to model long-range dependencies. Max-pooling operations lose fine-grained temporal information and positional details that are often crucial for sequence understanding. ## Why CNNs not ideal for Structural Modeling of Language - To model language properly we need to be able to... 1. Consider unbounded histories - n-grams, fixed limited kernels cannot achieve this - stacking of convolutions expands the fixed-sized context, but is still fixed - max-over-time pooling is unbounded, but ... 2. Consider structural/hierarchical properties of language - n-grams, fixed limited kernels can only model fixed-sized, local structural properties - max-over-time pooling is flat and order-insensitive 3. Input can be enriched by adding syntactic information, e.g., syntactic dependency information - syntactic parsers are only available for some languages - not clear which syntactic information is really required for which task (feature engineering) --- ## References