際際滷

際際滷Share a Scribd company logo
Papers We Love
#pwlnepal
22 NOVEMBER, 2015
A Morphosyntactic Categorization Scheme
for the Automated Analysis of Nepali
Andrew Hardie, Ram Raj Lohani, Bhim N. Regmi and Yogendra P. Yadava
INTRODUCTION
TO
NATURAL LANGUAGE PROCESSING
Ashmit Bhattarai
COMPUTER ENGINEERING devilcrackstheearth@gmail.com KATHMANDU UNIVERSITY
Topics
To
Discuss
 Morphology
 Tokenization
 Words to relate
1. INTRODUCTION
Morphosyntactic Tagging
Features of tagsets
 Precise and Distinct
 Optimal Distinctions
2. TOKENISATION
i. TOKENS
- Appropriate size unites for
morphosyntactic analysis
- Grammatical categories
assigned
ii. ORTHOGRAPHIC WORD
- Set of strings bounded by
whitespace or punctuation
NOTES
- Separate sentences into tokens
- OW < Tokens = multiword units, not investigated yet
- Graphical word with multiple elements => Tokenized Separately
- Tokens are separated by space for written language
iii. CLITICS
- A morpheme that has
syntactic characteristics of
word
- must be tokenized
- can be postpositions or
affixes
2. TOKENISATION
- Mark oblique cases
- Also written as part of orthographic word as noun, adj. or
other word whose case they mark
- Suffixes except haru (Plural or Collective), ko/kii/kaa
(genetive), le(ergative), lai (accusative/dative) are
postpositions
Postpositions
USES
2. TOKENISATION
ISSUE
- Analyse as inflection
element as noun
- Add separate tokens
- Different consideration
for suffixes on one hand
and other
METHODS PROBLEMS
- For singular ergative noun
le, use NN1E
- For plural accusative
noun harulaai, use NN2A
Layer II Postpositions
- Hard to know when to
treat postpostion as
suffixes but clitics
(Assign Tokens
ma/bata/sanga)
- Suffixes can get attached
to noun, pronoun, adj
and adverb too
Conclusion: Abandon
NN1E / NN2A
2. TOKENISATION
SOLUTION
- Category of postposition is tagged as II
- Plural collective marker haru tagged as IH
- Genitive postpositions ko/kii/kaa : IKM/IKF/IKO
respectively
- Eragative-instrumental PP le : IE
- Accusative/dative PP laai : IA
- Possessive Pronouns
mero : PMXKM, tero: PTNKM, aafno : PRFKM
Postpositions
3. GENDER ON NOUNS AND
ADJECTIVES
- Nepali has grammatically marked gender
- Masculine => suffix o
Feminine => suffix ii
Other => suffix aa
- The default other noun and suffixes is mostly
masculine
3. GENDER ON NOUNS AND
ADJECTIVES
ISSUE
- Most of the Adj., nouns,
descriptive determiners
like bibhinna, sampurna
are not gender marked
- Feminine noun ending
with ii like aaimaai
donot have respective
masculine noun ending
with o
- Gender marked form
yetro has unmarked
forms yo/yi/eti
METHODS PROBLEMS
- Ignore Gender Inflection
altogether
- Difficult to extract feminine
marked adj. due to false
positives such as dhani
ending with ii
- Including gender marking in
tagging system causes
problem for unmarked words
and complicates automated
tagging
3. GENDER ON NOUNS AND
ADJECTIVES
- Assign following tags JM, JF, JO and JX to
suffixe o (masucline), suffix ii (feminine),
suffix aa (other) and unmarked Adj.
respectively
- Ignore plural, public and honorificity for
simplicity
- Ignore gender marking on nouns
Example: Sita as NP and aaimaaii as
NN
SOLUTION
4. MODELLING NEPALI VERB INFLECTION
ISSUE
- Multiplicity of inflected forms
bhanidiyeko
- Compound verbs = main verb + vector
verb / light verb
- garidiyo = gari + diyo
- Tense-aspect mood combination
created by use of auxiliary verbs to
form compounded form
hunu/ hunthyo/ huncha/ bhairahayo
/hunecha
- Each compounded verb can represent
voice, tense, mood, aspect, person,
gender, number, honorificity and vector
verb. This leads to large number of
tagsets.
METHODS
- Possible solutions :
Probabilistic approach
(Markov model)
- Training data == (tagset)2
- Impractical as tagset grows
- Re-Tokenization approach
- Difficult to trace root word, root verb
- Example :
huncha => hun + cha
Solution
- Simplify assumptions for descriptions of Nepali verbs underlying
the tagset.
- Tag accordingly to last element of compound verb (Person-
Number-Gender Inflection)
- High honorific verb is tagged in isolation
- For non-finite form receives a tag of its own in a tagset
- For finite form, the tagsets are possible combination of:
4. MODELLING NEPALI VERB INFLECTION
THANK YOU !!
#pwlnepal

More Related Content

A morphosyntactic-categorization-scheme-for-the-automated-analysis-of-nepali

  • 2. A Morphosyntactic Categorization Scheme for the Automated Analysis of Nepali Andrew Hardie, Ram Raj Lohani, Bhim N. Regmi and Yogendra P. Yadava
  • 3. INTRODUCTION TO NATURAL LANGUAGE PROCESSING Ashmit Bhattarai COMPUTER ENGINEERING devilcrackstheearth@gmail.com KATHMANDU UNIVERSITY
  • 5. 1. INTRODUCTION Morphosyntactic Tagging Features of tagsets Precise and Distinct Optimal Distinctions
  • 6. 2. TOKENISATION i. TOKENS - Appropriate size unites for morphosyntactic analysis - Grammatical categories assigned ii. ORTHOGRAPHIC WORD - Set of strings bounded by whitespace or punctuation NOTES - Separate sentences into tokens - OW < Tokens = multiword units, not investigated yet - Graphical word with multiple elements => Tokenized Separately - Tokens are separated by space for written language iii. CLITICS - A morpheme that has syntactic characteristics of word - must be tokenized - can be postpositions or affixes
  • 7. 2. TOKENISATION - Mark oblique cases - Also written as part of orthographic word as noun, adj. or other word whose case they mark - Suffixes except haru (Plural or Collective), ko/kii/kaa (genetive), le(ergative), lai (accusative/dative) are postpositions Postpositions USES
  • 8. 2. TOKENISATION ISSUE - Analyse as inflection element as noun - Add separate tokens - Different consideration for suffixes on one hand and other METHODS PROBLEMS - For singular ergative noun le, use NN1E - For plural accusative noun harulaai, use NN2A Layer II Postpositions - Hard to know when to treat postpostion as suffixes but clitics (Assign Tokens ma/bata/sanga) - Suffixes can get attached to noun, pronoun, adj and adverb too Conclusion: Abandon NN1E / NN2A
  • 9. 2. TOKENISATION SOLUTION - Category of postposition is tagged as II - Plural collective marker haru tagged as IH - Genitive postpositions ko/kii/kaa : IKM/IKF/IKO respectively - Eragative-instrumental PP le : IE - Accusative/dative PP laai : IA - Possessive Pronouns mero : PMXKM, tero: PTNKM, aafno : PRFKM Postpositions
  • 10. 3. GENDER ON NOUNS AND ADJECTIVES - Nepali has grammatically marked gender - Masculine => suffix o Feminine => suffix ii Other => suffix aa - The default other noun and suffixes is mostly masculine
  • 11. 3. GENDER ON NOUNS AND ADJECTIVES ISSUE - Most of the Adj., nouns, descriptive determiners like bibhinna, sampurna are not gender marked - Feminine noun ending with ii like aaimaai donot have respective masculine noun ending with o - Gender marked form yetro has unmarked forms yo/yi/eti METHODS PROBLEMS - Ignore Gender Inflection altogether - Difficult to extract feminine marked adj. due to false positives such as dhani ending with ii - Including gender marking in tagging system causes problem for unmarked words and complicates automated tagging
  • 12. 3. GENDER ON NOUNS AND ADJECTIVES - Assign following tags JM, JF, JO and JX to suffixe o (masucline), suffix ii (feminine), suffix aa (other) and unmarked Adj. respectively - Ignore plural, public and honorificity for simplicity - Ignore gender marking on nouns Example: Sita as NP and aaimaaii as NN SOLUTION
  • 13. 4. MODELLING NEPALI VERB INFLECTION ISSUE - Multiplicity of inflected forms bhanidiyeko - Compound verbs = main verb + vector verb / light verb - garidiyo = gari + diyo - Tense-aspect mood combination created by use of auxiliary verbs to form compounded form hunu/ hunthyo/ huncha/ bhairahayo /hunecha - Each compounded verb can represent voice, tense, mood, aspect, person, gender, number, honorificity and vector verb. This leads to large number of tagsets. METHODS - Possible solutions : Probabilistic approach (Markov model) - Training data == (tagset)2 - Impractical as tagset grows - Re-Tokenization approach - Difficult to trace root word, root verb - Example : huncha => hun + cha
  • 14. Solution - Simplify assumptions for descriptions of Nepali verbs underlying the tagset. - Tag accordingly to last element of compound verb (Person- Number-Gender Inflection) - High honorific verb is tagged in isolation - For non-finite form receives a tag of its own in a tagset - For finite form, the tagsets are possible combination of: 4. MODELLING NEPALI VERB INFLECTION