際際滷

際際滷Share a Scribd company logo
Papers We Love
#pwlnepal
22 NOVEMBER, 2015
A Morphosyntactic Categorization Scheme
for the Automated Analysis of Nepali
Andrew Hardie, Ram Raj Lohani, Bhim N. Regmi and Yogendra P. Yadava
INTRODUCTION
TO
NATURAL LANGUAGE PROCESSING
Ashmit Bhattarai
COMPUTER ENGINEERING devilcrackstheearth@gmail.com KATHMANDU UNIVERSITY
Topics
To
Discuss
 Morphology
 Tokenization
 Words to relate
1. INTRODUCTION
Morphosyntactic Tagging
Features of tagsets
 Precise and Distinct
 Optimal Distinctions
2. TOKENISATION
i. TOKENS
- Appropriate size unites for
morphosyntactic analysis
- Grammatical categories
assigned
ii. ORTHOGRAPHIC WORD
- Set of strings bounded by
whitespace or punctuation
NOTES
- Separate sentences into tokens
- OW < Tokens = multiword units, not investigated yet
- Graphical word with multiple elements => Tokenized Separately
- Tokens are separated by space for written language
iii. CLITICS
- A morpheme that has
syntactic characteristics of
word
- must be tokenized
- can be postpositions or
affixes
2. TOKENISATION
- Mark oblique cases
- Also written as part of orthographic word as noun, adj. or
other word whose case they mark
- Suffixes except haru (Plural or Collective), ko/kii/kaa
(genetive), le(ergative), lai (accusative/dative) are
postpositions
Postpositions
USES
2. TOKENISATION
ISSUE
- Analyse as inflection
element as noun
- Add separate tokens
- Different consideration
for suffixes on one hand
and other
METHODS PROBLEMS
- For singular ergative noun
le, use NN1E
- For plural accusative
noun harulaai, use NN2A
Layer II Postpositions
- Hard to know when to
treat postpostion as
suffixes but clitics
(Assign Tokens
ma/bata/sanga)
- Suffixes can get attached
to noun, pronoun, adj
and adverb too
Conclusion: Abandon
NN1E / NN2A
2. TOKENISATION
SOLUTION
- Category of postposition is tagged as II
- Plural collective marker haru tagged as IH
- Genitive postpositions ko/kii/kaa : IKM/IKF/IKO
respectively
- Eragative-instrumental PP le : IE
- Accusative/dative PP laai : IA
- Possessive Pronouns
mero : PMXKM, tero: PTNKM, aafno : PRFKM
Postpositions
3. GENDER ON NOUNS AND
ADJECTIVES
- Nepali has grammatically marked gender
- Masculine => suffix o
Feminine => suffix ii
Other => suffix aa
- The default other noun and suffixes is mostly
masculine
3. GENDER ON NOUNS AND
ADJECTIVES
ISSUE
- Most of the Adj., nouns,
descriptive determiners
like bibhinna, sampurna
are not gender marked
- Feminine noun ending
with ii like aaimaai
donot have respective
masculine noun ending
with o
- Gender marked form
yetro has unmarked
forms yo/yi/eti
METHODS PROBLEMS
- Ignore Gender Inflection
altogether
- Difficult to extract feminine
marked adj. due to false
positives such as dhani
ending with ii
- Including gender marking in
tagging system causes
problem for unmarked words
and complicates automated
tagging
3. GENDER ON NOUNS AND
ADJECTIVES
- Assign following tags JM, JF, JO and JX to
suffixe o (masucline), suffix ii (feminine),
suffix aa (other) and unmarked Adj.
respectively
- Ignore plural, public and honorificity for
simplicity
- Ignore gender marking on nouns
Example: Sita as NP and aaimaaii as
NN
SOLUTION
4. MODELLING NEPALI VERB INFLECTION
ISSUE
- Multiplicity of inflected forms
bhanidiyeko
- Compound verbs = main verb + vector
verb / light verb
- garidiyo = gari + diyo
- Tense-aspect mood combination
created by use of auxiliary verbs to
form compounded form
hunu/ hunthyo/ huncha/ bhairahayo
/hunecha
- Each compounded verb can represent
voice, tense, mood, aspect, person,
gender, number, honorificity and vector
verb. This leads to large number of
tagsets.
METHODS
- Possible solutions :
Probabilistic approach
(Markov model)
- Training data == (tagset)2
- Impractical as tagset grows
- Re-Tokenization approach
- Difficult to trace root word, root verb
- Example :
huncha => hun + cha
Solution
- Simplify assumptions for descriptions of Nepali verbs underlying
the tagset.
- Tag accordingly to last element of compound verb (Person-
Number-Gender Inflection)
- High honorific verb is tagged in isolation
- For non-finite form receives a tag of its own in a tagset
- For finite form, the tagsets are possible combination of:
4. MODELLING NEPALI VERB INFLECTION
THANK YOU !!
#pwlnepal

More Related Content

Viewers also liked (7)

PDF
Failed queries: a morpho-syntactic analysis based on transaction log files
Giannis Tsakonas
PDF
LING 100 - Morphosyntactic Categories
Meagan Louie
PDF
Collocation and multi word lexemes
Jon Mills
PPTX
Morphology # Productivity in Word-Formation
Ani Istiana
PPT
Word vs lexeme by james jamie 2014 presentation assigned by asifa memon lect...
James Jamie
PPT
Morphological Analysis
Innowiz
PPTX
Words and lexemes ppt
Angeline-dbz
Failed queries: a morpho-syntactic analysis based on transaction log files
Giannis Tsakonas
LING 100 - Morphosyntactic Categories
Meagan Louie
Collocation and multi word lexemes
Jon Mills
Morphology # Productivity in Word-Formation
Ani Istiana
Word vs lexeme by james jamie 2014 presentation assigned by asifa memon lect...
James Jamie
Morphological Analysis
Innowiz
Words and lexemes ppt
Angeline-dbz

Similar to A morphosyntactic-categorization-scheme-for-the-automated-analysis-of-nepali (20)

PDF
Ijartes v1-i1-002
IJARTES
PDF
Adhyann a hybrid part of-speech tagger
ijitjournal
PPTX
LONGSEM2024-25_CSE3015_ETH_AP2024256000125_Reference-Material-II.pptx
vemuripraveena2622
PDF
part of speech tagger for ARABIC TEXT
arteimi
PPTX
MEBI 591C/598 Data and Text Mining in Biomedical Informatics
butest
PDF
Welcome to International Journal of Engineering Research and Development (IJERD)
IJERD Editor
PPTX
Natural Language Processing_in semantic web.pptx
AlyaaMachi
PPT
haenelt.ppt
ssuser4293bd
PPTX
Presentation1
Ritikesh Bhaskarwar
PPTX
NLP
Jeet Das
PDF
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)
ijnlc
PDF
Ijarcet vol-3-issue-3-623-625 (1)
Dhabal Sethi
PDF
Ijarcet vol-2-issue-2-323-329
Editor IJARCET
PDF
learn about text preprocessing nip using nltk
en21cs301047
DOCX
Pos Tagging for Classical Tamil Texts
ijcnes
PDF
A GRAMMATICALLY AND STRUCTURALLY BASED PART OF SPEECH (POS) TAGGER FOR ARABIC...
kevig
PDF
A GRAMMATICALLY AND STRUCTURALLY BASED PART OF SPEECH (POS) TAGGER FOR ARABIC...
kevig
PPTX
NLP_KASHK:Text Normalization
Hemantha Kulathilake
DOC
Part of speech tagger
arteimi
PPTX
Sanskrit parser Project Report
Laxmi Kant Yadav
Ijartes v1-i1-002
IJARTES
Adhyann a hybrid part of-speech tagger
ijitjournal
LONGSEM2024-25_CSE3015_ETH_AP2024256000125_Reference-Material-II.pptx
vemuripraveena2622
part of speech tagger for ARABIC TEXT
arteimi
MEBI 591C/598 Data and Text Mining in Biomedical Informatics
butest
Welcome to International Journal of Engineering Research and Development (IJERD)
IJERD Editor
Natural Language Processing_in semantic web.pptx
AlyaaMachi
haenelt.ppt
ssuser4293bd
Presentation1
Ritikesh Bhaskarwar
NLP
Jeet Das
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)
ijnlc
Ijarcet vol-3-issue-3-623-625 (1)
Dhabal Sethi
Ijarcet vol-2-issue-2-323-329
Editor IJARCET
learn about text preprocessing nip using nltk
en21cs301047
Pos Tagging for Classical Tamil Texts
ijcnes
A GRAMMATICALLY AND STRUCTURALLY BASED PART OF SPEECH (POS) TAGGER FOR ARABIC...
kevig
A GRAMMATICALLY AND STRUCTURALLY BASED PART OF SPEECH (POS) TAGGER FOR ARABIC...
kevig
NLP_KASHK:Text Normalization
Hemantha Kulathilake
Part of speech tagger
arteimi
Sanskrit parser Project Report
Laxmi Kant Yadav
Ad

Recently uploaded (20)

PPTX
CST413 KTU S7 CSE Machine Learning Clustering K Means Hierarchical Agglomerat...
resming1
PDF
Generative AI & Scientific Research : Catalyst for Innovation, Ethics & Impact
AlqualsaDIResearchGr
PDF
Validating a Citizen Observatories enabling Platform by completing a Citizen ...
Diego L坦pez-de-Ipi単a Gonz叩lez-de-Artaza
PDF
NFPA 10 - Estandar para extintores de incendios portatiles (ed.22 ENG).pdf
Oscar Orozco
PDF
Rapid Prototyping for XR: Lecture 4 - High Level Prototyping.
Mark Billinghurst
PPT
FINAL plumbing code for board exam passer
MattKristopherDiaz
PPTX
FSE_LLM4SE1_A Tool for In-depth Analysis of Code Execution Reasoning of Large...
cl144
PPSX
OOPS Concepts in Python and Exception Handling
Dr. A. B. Shinde
PDF
CLIP_Internals_and_Architecture.pdf sdvsdv sdv
JoseLuisCahuanaRamos3
PPTX
CST413 KTU S7 CSE Machine Learning Introduction Parameter Estimation MLE MAP ...
resming1
PDF
Designing for Tomorrow Architectures Role in the Sustainability Movement
BIM Services
PPTX
How to Un-Obsolete Your Legacy Keypad Design
Epec Engineered Technologies
PDF
13th International Conference of Security, Privacy and Trust Management (SPTM...
ijcisjournal
PDF
惠惘惘 惺 悋惠忰 悋惆悋 惠惆 悋悋悄 忰 悴悋忰.pdf
忰惆 惶惶 惠惠悸
PPTX
Precooling and Refrigerated storage.pptx
ThongamSunita
PPTX
Kel.3_A_Review_on_Internet_of_Things_for_Defense_v3.pptx
Endang Saefullah
PPTX
Mobile database systems 20254545645.pptx
herosh1968
PDF
How to Buy Verified CashApp Accounts IN 2025
Buy Verified CashApp Accounts
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
AsadShad4
PPTX
Introduction to Python Programming Language
merlinjohnsy
CST413 KTU S7 CSE Machine Learning Clustering K Means Hierarchical Agglomerat...
resming1
Generative AI & Scientific Research : Catalyst for Innovation, Ethics & Impact
AlqualsaDIResearchGr
Validating a Citizen Observatories enabling Platform by completing a Citizen ...
Diego L坦pez-de-Ipi単a Gonz叩lez-de-Artaza
NFPA 10 - Estandar para extintores de incendios portatiles (ed.22 ENG).pdf
Oscar Orozco
Rapid Prototyping for XR: Lecture 4 - High Level Prototyping.
Mark Billinghurst
FINAL plumbing code for board exam passer
MattKristopherDiaz
FSE_LLM4SE1_A Tool for In-depth Analysis of Code Execution Reasoning of Large...
cl144
OOPS Concepts in Python and Exception Handling
Dr. A. B. Shinde
CLIP_Internals_and_Architecture.pdf sdvsdv sdv
JoseLuisCahuanaRamos3
CST413 KTU S7 CSE Machine Learning Introduction Parameter Estimation MLE MAP ...
resming1
Designing for Tomorrow Architectures Role in the Sustainability Movement
BIM Services
How to Un-Obsolete Your Legacy Keypad Design
Epec Engineered Technologies
13th International Conference of Security, Privacy and Trust Management (SPTM...
ijcisjournal
惠惘惘 惺 悋惠忰 悋惆悋 惠惆 悋悋悄 忰 悴悋忰.pdf
忰惆 惶惶 惠惠悸
Precooling and Refrigerated storage.pptx
ThongamSunita
Kel.3_A_Review_on_Internet_of_Things_for_Defense_v3.pptx
Endang Saefullah
Mobile database systems 20254545645.pptx
herosh1968
How to Buy Verified CashApp Accounts IN 2025
Buy Verified CashApp Accounts
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
AsadShad4
Introduction to Python Programming Language
merlinjohnsy
Ad

A morphosyntactic-categorization-scheme-for-the-automated-analysis-of-nepali

  • 2. A Morphosyntactic Categorization Scheme for the Automated Analysis of Nepali Andrew Hardie, Ram Raj Lohani, Bhim N. Regmi and Yogendra P. Yadava
  • 3. INTRODUCTION TO NATURAL LANGUAGE PROCESSING Ashmit Bhattarai COMPUTER ENGINEERING devilcrackstheearth@gmail.com KATHMANDU UNIVERSITY
  • 5. 1. INTRODUCTION Morphosyntactic Tagging Features of tagsets Precise and Distinct Optimal Distinctions
  • 6. 2. TOKENISATION i. TOKENS - Appropriate size unites for morphosyntactic analysis - Grammatical categories assigned ii. ORTHOGRAPHIC WORD - Set of strings bounded by whitespace or punctuation NOTES - Separate sentences into tokens - OW < Tokens = multiword units, not investigated yet - Graphical word with multiple elements => Tokenized Separately - Tokens are separated by space for written language iii. CLITICS - A morpheme that has syntactic characteristics of word - must be tokenized - can be postpositions or affixes
  • 7. 2. TOKENISATION - Mark oblique cases - Also written as part of orthographic word as noun, adj. or other word whose case they mark - Suffixes except haru (Plural or Collective), ko/kii/kaa (genetive), le(ergative), lai (accusative/dative) are postpositions Postpositions USES
  • 8. 2. TOKENISATION ISSUE - Analyse as inflection element as noun - Add separate tokens - Different consideration for suffixes on one hand and other METHODS PROBLEMS - For singular ergative noun le, use NN1E - For plural accusative noun harulaai, use NN2A Layer II Postpositions - Hard to know when to treat postpostion as suffixes but clitics (Assign Tokens ma/bata/sanga) - Suffixes can get attached to noun, pronoun, adj and adverb too Conclusion: Abandon NN1E / NN2A
  • 9. 2. TOKENISATION SOLUTION - Category of postposition is tagged as II - Plural collective marker haru tagged as IH - Genitive postpositions ko/kii/kaa : IKM/IKF/IKO respectively - Eragative-instrumental PP le : IE - Accusative/dative PP laai : IA - Possessive Pronouns mero : PMXKM, tero: PTNKM, aafno : PRFKM Postpositions
  • 10. 3. GENDER ON NOUNS AND ADJECTIVES - Nepali has grammatically marked gender - Masculine => suffix o Feminine => suffix ii Other => suffix aa - The default other noun and suffixes is mostly masculine
  • 11. 3. GENDER ON NOUNS AND ADJECTIVES ISSUE - Most of the Adj., nouns, descriptive determiners like bibhinna, sampurna are not gender marked - Feminine noun ending with ii like aaimaai donot have respective masculine noun ending with o - Gender marked form yetro has unmarked forms yo/yi/eti METHODS PROBLEMS - Ignore Gender Inflection altogether - Difficult to extract feminine marked adj. due to false positives such as dhani ending with ii - Including gender marking in tagging system causes problem for unmarked words and complicates automated tagging
  • 12. 3. GENDER ON NOUNS AND ADJECTIVES - Assign following tags JM, JF, JO and JX to suffixe o (masucline), suffix ii (feminine), suffix aa (other) and unmarked Adj. respectively - Ignore plural, public and honorificity for simplicity - Ignore gender marking on nouns Example: Sita as NP and aaimaaii as NN SOLUTION
  • 13. 4. MODELLING NEPALI VERB INFLECTION ISSUE - Multiplicity of inflected forms bhanidiyeko - Compound verbs = main verb + vector verb / light verb - garidiyo = gari + diyo - Tense-aspect mood combination created by use of auxiliary verbs to form compounded form hunu/ hunthyo/ huncha/ bhairahayo /hunecha - Each compounded verb can represent voice, tense, mood, aspect, person, gender, number, honorificity and vector verb. This leads to large number of tagsets. METHODS - Possible solutions : Probabilistic approach (Markov model) - Training data == (tagset)2 - Impractical as tagset grows - Re-Tokenization approach - Difficult to trace root word, root verb - Example : huncha => hun + cha
  • 14. Solution - Simplify assumptions for descriptions of Nepali verbs underlying the tagset. - Tag accordingly to last element of compound verb (Person- Number-Gender Inflection) - High honorific verb is tagged in isolation - For non-finite form receives a tag of its own in a tagset - For finite form, the tagsets are possible combination of: 4. MODELLING NEPALI VERB INFLECTION