�ݺ�ߣ

Papers We Love
#pwlnepal
22 NOVEMBER, 2015

A Morphosyntactic Categorization Scheme
for the Automated Analysis of Nepali
Andrew Hardie, Ram Raj Lohani, Bhim N. Regmi and Yogendra P. Yadava

INTRODUCTION
TO
NATURAL LANGUAGE PROCESSING
Ashmit Bhattarai
COMPUTER ENGINEERING devilcrackstheearth@gmail.com KATHMANDU UNIVERSITY

Topics
To
Discuss
● Morphology
● Tokenization
● Words to relate

1. INTRODUCTION
Morphosyntactic Tagging
Features of tagsets
➔ Precise and Distinct
➔ Optimal Distinctions

2. TOKENISATION
i. TOKENS
- Appropriate size unites for
morphosyntactic analysis
- Grammatical categories
assigned
ii. ORTHOGRAPHIC WORD
- Set of strings bounded by
whitespace or punctuation
NOTES
- Separate sentences into tokens
- OW < Tokens = multiword units, not investigated yet
- Graphical word with multiple elements => Tokenized Separately
- Tokens are separated by space for written language
iii. CLITICS
- A morpheme that has
syntactic characteristics of
word
- must be tokenized
- can be postpositions or
affixes

2. TOKENISATION
- Mark oblique cases
- Also written as part of orthographic word as noun, adj. or
other word whose case they mark
- Suffixes except “haru” (Plural or Collective), “ko/kii/kaa”
(genetive), “le”(ergative), “lai” (accusative/dative) are
postpositions
Postpositions
USES

2. TOKENISATION
ISSUE
- Analyse as inflection
element as noun
- Add separate tokens
- Different consideration
for suffixes on one hand
and other
METHODS PROBLEMS
- For singular ergative noun
“le”, use NN1E
- For plural accusative
noun ”harulaai”, use NN2A
Layer II Postpositions
- Hard to know when to
treat postpostion as
suffixes but clitics
(Assign Tokens
“ma/bata/sanga”)
- Suffixes can get attached
to noun, pronoun, adj
and adverb too
Conclusion: Abandon
NN1E / NN2A

2. TOKENISATION
SOLUTION
- Category of postposition is tagged as II
- Plural collective marker “haru” tagged as IH
- Genitive postpositions “ko/kii/kaa” : IKM/IKF/IKO
respectively
- Eragative-instrumental PP “le” : IE
- Accusative/dative PP ”laai” : IA
- Possessive Pronouns
“mero” : PMXKM, “tero”: PTNKM, “aafno” : PRFKM
Postpositions

3. GENDER ON NOUNS AND
ADJECTIVES
- Nepali has grammatically marked gender
- Masculine => suffix “o”
Feminine => suffix “ii”
Other => suffix “aa”
- The default other noun and suffixes is mostly
masculine

ADJECTIVES
ISSUE
- Most of the Adj., nouns,
descriptive determiners
like “bibhinna, sampurna”
are not gender marked
- Feminine noun ending
with “ii” like “aaimaai”
donot have respective
masculine noun ending
with “o”
- Gender marked form
“yetro” has unmarked
forms “yo/yi/eti”
METHODS PROBLEMS
- Ignore Gender Inflection
altogether
- Difficult to extract feminine
marked adj. due to false
positives such as “dhani”
ending with “ii”
- Including gender marking in
tagging system causes
problem for unmarked words
and complicates automated
tagging

ADJECTIVES
- Assign following tags JM, JF, JO and JX to
suffixe “o” (masucline), suffix “ii” (feminine),
suffix “aa” (other) and unmarked Adj.
respectively
- Ignore plural, public and honorificity for
simplicity
- Ignore gender marking on nouns
Example: “Sita” as NP and “aaimaaii” as
NN
SOLUTION

4. MODELLING NEPALI VERB INFLECTION
ISSUE
- Multiplicity of inflected forms
“bhanidiyeko”
- Compound verbs = main verb + vector
verb / light verb
- “garidiyo” = “gari” + “diyo”
- Tense-aspect mood combination
created by use of auxiliary verbs to
form compounded form
“hunu/ hunthyo/ huncha/ bhairahayo
/hunecha”
- Each compounded verb can represent
voice, tense, mood, aspect, person,
gender, number, honorificity and vector
verb. This leads to large number of
tagsets.
METHODS
- Possible solutions :
Probabilistic approach
(Markov model)
- Training data == (tagset)2
- Impractical as tagset grows
- Re-Tokenization approach
- Difficult to trace root word, root verb
- Example :
“huncha” => “hun” + “cha”

Solution
- Simplify assumptions for descriptions of Nepali verbs underlying
the tagset.
- Tag accordingly to last element of compound verb (Person-
Number-Gender Inflection)
- High honorific verb is tagged in isolation
- For non-finite form receives a tag of its own in a tagset
- For finite form, the tagsets are possible combination of:
4. MODELLING NEPALI VERB INFLECTION

�ݺ�ߣ

A morphosyntactic-categorization-scheme-for-the-automated-analysis-of-nepali

More Related Content

Viewers also liked (7)

Similar to A morphosyntactic-categorization-scheme-for-the-automated-analysis-of-nepali (20)

Recently uploaded (20)

A morphosyntactic-categorization-scheme-for-the-automated-analysis-of-nepali