Morphological processing

# Morphological processing Morphology studies the structure of words. It analyzes the structure of words and parts of words, such as stems, root words, prefixes, and suffixes. Morphology also looks at parts of speech, intonation and stress, and the ways context can change a word's pronunciation and meaning. ## Morpheme The minimal information carrying unit. two types affix and stems. ### Affix Morpheme which only occurs in conjunction with other morphemes. suffix: dog+s, truth+ful prefix: un+ wise (derivational only) infix: Arabic stem k_t_b : kataba (he wrote); kotob (books) In English: sang (stem sing): not productive e.g., (maybe) absobloodylutely circumfix: not in English German ge+kauf+t (stem kauf, affix ge-t) productivity: whether affix applies generally, whether it applies to new words sing, sang, sung | ring, rang, rung BUT: ping, pinged, pinged So this infixation pattern is not productive: sing, ring are irregular ### Stem words made up of _stem_ (more than one for compounds) and zero or more affixes., e.g., dog+s, book+shop+s Note that slither, slide, slip etc have somewhat similar meanings, but sl- not a morpheme, as when broken the remaining part should have a meaning of its own. ## Inflectional vs derivational morphology Inflectional - e.g., plural suffix +s, past participle +ed - sets slots in some paradigm, e.g., tense, aspect, number, person, gender, case inflectional affixes are not combined in English - generally fully productive (except irregular forms), e.g., texted Derivational - e.g., un-, re-, anti-, -ism, -ist etc - broad range of semantic possibilities, may change part of speech - indefinite combinations e.g., antiantidisestablishmentarianism anti-anti-dis-establish-ment-arian-ism - generally semi-productive: e.g., escapee, textee, ?dropee, ?snoree, _cricketee (_ and ?) - zero-derivation: e.g. tango, waltz ## Internal structure and ambiguity Morpheme ambiguity: stems and affixes may be individually ambiguous: e.g. dog (noun or verb), +s(plural or 3persg-verb) Structural ambiguity: e.g., shorts or short -s unionised could be union -ise -ed or un-ion -ise -ed Bracketing: un- ion -ise -ed - \*((un-ion)-ise) -ed - un - ((ion -ise) -ed) ## Applications of morphological processing First approach would be to compiling a full form lexicon. Could be reasonable in languages like english but for languages with rich morphology i.e. Russian, it's not practical and redundant, for example run: ![[morphology-russian.jpg]] ### Stemming - Useful for applications which doesn't care about full analysis of words or sentences such as [[Information Retrieval]] or classification ### Lemmatization - Finding stems and affixes as a percursor to parsing (often inflections only) ### Generation - Morphological processing may be bidirectional: i.e., parsing and qeneration. - party + PLURAL <-> parties, sleep + PAST_VERB <-> slepts ## Lexical requirements for morphological processing - affixes, plus the associated information conveyed by the affix - ed PAST_VERB - ed PSP_VERB - s PLURAL_NOUN - irregular forms, with associated information similar to that for affixes - began PAST_VERB begin - begun PSP_VERB begin - stems with syntactic categories, e.g. to avoid corpus being analysed as corpu -s ## Spelling rules English morphology is essentially _concatenative_ - irregular morphology - inflectional forms have to be listed regular _phonological_ and _spelling changes_ associated with affixation, e.g. - - $\mathrm{S}$ is pronounced differently with stem ending in $\mathrm{s}, \mathrm{x}$ or $\mathrm{z}$ - spelling reflects this with the addition of an $e$ (boxes etc) In English, description is independent of particular stems/affixes ### Finite State Transducers e.g. box s to boxes $ \varepsilon \rightarrow \mathrm{e} /\left\{\begin{array}{l} \mathrm{s} \\ \mathrm{x} \\ \mathrm{z} \end{array}\right\} \hat{-}^{s} $ Can be formally implemented as a finate state transducer. Following transducer implements the es rule: ![[es-transducer.jpg]] These kinds of sequence processing models usually suffer from _Overgeneration_ - It is impossible to build a model that is going to accept all and only the correct sequences of a given language. - Typically what we are trying to do is build a model that accepts all the valid sequences of the language and not overgenerate too much. To deal with overgeneration, we can take the analysis and check if valid stem and affix. FSTs assume tokenization (word boundaries) and words split into characters. One character pair per transition! Analysis: return character list with affix boundaries, so enabling lexical lookup. Generation: input comes from stem and affix lexicons. One FST per spelling rule: either compile to big FST or run in parallel. FSTs do not allow for internal structure: - can't model un- ion -ize - d bracketing. ## Implementation of morphological processing 1. Rule-based methods, e.g. the Porter stemmer - part of NLTK toolkit - not fully accurate all the time - works well for most applications 2. Probabilistic models for morphological segmentation 3. Neural models with character-level input --- ## References - Morphology Lecture Notes, UvA NLP1 https://cl-illc.github.io/nlp1/resources/slides/Morphology-notes.pdf