This document discusses topics related to developing a morphosyntactic tagging scheme for the automated analysis of Nepali text, including:
1. Issues with tokenization, such as how to handle clitics and postpositions.
2. Proposed solutions for tokenization, including assigning part-of-speech tags to suffixes, clitics, and postpositions.
3. Issues with modeling gender on nouns and adjectives in Nepali.
4. Challenges in modeling verb inflection in Nepali due to complex compound verb forms and proposed solutions such as simplifying assumptions and tagging the last element of compound verbs.
1 of 15
Download to read offline
More Related Content
A morphosyntactic-categorization-scheme-for-the-automated-analysis-of-nepali
6. 2. TOKENISATION
i. TOKENS
- Appropriate size unites for
morphosyntactic analysis
- Grammatical categories
assigned
ii. ORTHOGRAPHIC WORD
- Set of strings bounded by
whitespace or punctuation
NOTES
- Separate sentences into tokens
- OW < Tokens = multiword units, not investigated yet
- Graphical word with multiple elements => Tokenized Separately
- Tokens are separated by space for written language
iii. CLITICS
- A morpheme that has
syntactic characteristics of
word
- must be tokenized
- can be postpositions or
affixes
7. 2. TOKENISATION
- Mark oblique cases
- Also written as part of orthographic word as noun, adj. or
other word whose case they mark
- Suffixes except haru (Plural or Collective), ko/kii/kaa
(genetive), le(ergative), lai (accusative/dative) are
postpositions
Postpositions
USES
8. 2. TOKENISATION
ISSUE
- Analyse as inflection
element as noun
- Add separate tokens
- Different consideration
for suffixes on one hand
and other
METHODS PROBLEMS
- For singular ergative noun
le, use NN1E
- For plural accusative
noun harulaai, use NN2A
Layer II Postpositions
- Hard to know when to
treat postpostion as
suffixes but clitics
(Assign Tokens
ma/bata/sanga)
- Suffixes can get attached
to noun, pronoun, adj
and adverb too
Conclusion: Abandon
NN1E / NN2A
9. 2. TOKENISATION
SOLUTION
- Category of postposition is tagged as II
- Plural collective marker haru tagged as IH
- Genitive postpositions ko/kii/kaa : IKM/IKF/IKO
respectively
- Eragative-instrumental PP le : IE
- Accusative/dative PP laai : IA
- Possessive Pronouns
mero : PMXKM, tero: PTNKM, aafno : PRFKM
Postpositions
10. 3. GENDER ON NOUNS AND
ADJECTIVES
- Nepali has grammatically marked gender
- Masculine => suffix o
Feminine => suffix ii
Other => suffix aa
- The default other noun and suffixes is mostly
masculine
11. 3. GENDER ON NOUNS AND
ADJECTIVES
ISSUE
- Most of the Adj., nouns,
descriptive determiners
like bibhinna, sampurna
are not gender marked
- Feminine noun ending
with ii like aaimaai
donot have respective
masculine noun ending
with o
- Gender marked form
yetro has unmarked
forms yo/yi/eti
METHODS PROBLEMS
- Ignore Gender Inflection
altogether
- Difficult to extract feminine
marked adj. due to false
positives such as dhani
ending with ii
- Including gender marking in
tagging system causes
problem for unmarked words
and complicates automated
tagging
12. 3. GENDER ON NOUNS AND
ADJECTIVES
- Assign following tags JM, JF, JO and JX to
suffixe o (masucline), suffix ii (feminine),
suffix aa (other) and unmarked Adj.
respectively
- Ignore plural, public and honorificity for
simplicity
- Ignore gender marking on nouns
Example: Sita as NP and aaimaaii as
NN
SOLUTION
13. 4. MODELLING NEPALI VERB INFLECTION
ISSUE
- Multiplicity of inflected forms
bhanidiyeko
- Compound verbs = main verb + vector
verb / light verb
- garidiyo = gari + diyo
- Tense-aspect mood combination
created by use of auxiliary verbs to
form compounded form
hunu/ hunthyo/ huncha/ bhairahayo
/hunecha
- Each compounded verb can represent
voice, tense, mood, aspect, person,
gender, number, honorificity and vector
verb. This leads to large number of
tagsets.
METHODS
- Possible solutions :
Probabilistic approach
(Markov model)
- Training data == (tagset)2
- Impractical as tagset grows
- Re-Tokenization approach
- Difficult to trace root word, root verb
- Example :
huncha => hun + cha
14. Solution
- Simplify assumptions for descriptions of Nepali verbs underlying
the tagset.
- Tag accordingly to last element of compound verb (Person-
Number-Gender Inflection)
- High honorific verb is tagged in isolation
- For non-finite form receives a tag of its own in a tagset
- For finite form, the tagsets are possible combination of:
4. MODELLING NEPALI VERB INFLECTION