This document describes the development of a finite state morphological transducer for Modern Standard Arabic called AraComLex. Key points:
1) AraComLex is open source, distributed under GPLv3, and contains over 30,000 lemmas derived from corpus analysis rather than classical Arabic.
2) It is compatible with Foma and can be tested online or built locally using provided instructions.
3) The transducer was created using a seed lexicon expanded using automatic lexical enrichment and feature prediction with machine learning to classify new lemmas.
1 of 22
Download to read offline
More Related Content
Attia sfcm presentation
1. A Lexical Database for Modern Standard
Arabic Interoperable with a Finite State
Morphological Transducer
Mohammed Attia, Pavel Pecina, Antonio Toral, Lamia
Tounsi, Josef van Genabith
National Centre for Language Technology (NCLT),
School of Computing, Dublin City University
Funded by:
Enterprise Ireland, the Irish Research Council for Science
Engineering and Technology (IRCSET), and
the EU projects PANACEA and META-NET
2. Contribution
We develop a finite state morphological
transducer for Modern Standard Arabic
1. Open source, distributed under the GPLv3 license
2. Large scale, more than 30,000 lemmas
3. Corpus based, truly representative of Modern
Standard Arabic and not Classical Arabic.
4. Compatible with Foma, an open-source fst compiler
3. Short Tutorial
(1) Download Foma
http://foma.sourceforge.net
(2) Download AraComLex
http://aracomlex.sourceforge.net
(3) Build the transducer: README
4. The transducer online
You can test the transducer online:
http://www.cngl.ie/aracomlex
5. Introduction
Modern Standard Arabic vs. Classical
Arabic
Current State of Arabic Lexicography
Lexicons are not corpus-based
Buckwalter Arabic Morphological Analyser
Importance of Lexical Resources
7. Aim
Constructing a lexical database for Modern
Standard Arabic
Building a finite-state morphological
transducer
8. Methodology
Using a medium-scale manually created lexicon
of 10,799 lemmas, with detailed info for:
Nouns (human/nonhuman, POS, Continuation Classes)
Verbs (transitive/intransitive, allow passive, allow
imperative)
Using statistics from a 1 billion word corpus
90% from the LDC's Arabic Gigaword
10% collected from the Al-Jazeera website
Using a pre-annotation tool: MADA+TOKAN
9. Methodology
Using Finite State Technology (XFST)
Bidrectional: Suitable for analysis and generation
handles concatenative and non-concatenative
morphotactics
Speed and efficiency in dealing with millions of
paths
Handles separated dependencies.
Handles phonological and orthographic changes
through alteration rules.
13. Methodology
Alteration Rules:
Alteration Rules are used for handling discrepancies
between surface forms and underlying representation or
lemmas. We have 130 replace rules.
a -> b || L _ R
14. Results to Date
Start-off with a seed lexicon
Four Lexical Databases, manually constructed
5,925 nominal lemmas
1,529 verb lemmas
490 patterns (456 for nominals and 34 for verbs)
lemma-root look up database
15. Results to Date
Automatically Extending the Lexical
Database: Lexical Enrichment
Data-driven filtering technique
40,648 lemmas (in Buckwalter or SAMA 3.1)
Statistics from three web search engines
Statistics from the corpus annotated by MADA
29,627 lemmas (left after filtering)
16. Results to Date
Automatically Extending the Lexical
Database: Feature Enrichment
Machine Learning
Multilayer Peceptron classification algorithm
Training Data: 4,816 nominals and 1,448 verbs
Classes for nominals: continuation classes (or inflection
paths), the semantico-grammatical feature of humanness,
and POS (noun or adjective)
Classes for verbs: transitivity, allowing the passive voice,
and allowing the imperative mood
We feed these datasets with frequency statistics from the
corpus and build a vector grid.
17. Results to Date
Extending the Lexical Database
Feature enrichment using Machine Learning
18. Results to Date
Extending the Lexical Database
With Machine Learning we add:
18,000 new lemmas:
12,974 nominals
5,034 verbs
20. Results to Date
FST Morphology Coverage and RPW
Results
a test corpus of 800,000 words, divided as
400,000 for Semi-Literary text
400,000 for General News texts.
21. Future Work
Going beyond SAMA
Including Named Entities and MWEs
Building a spell checker
22. Conclusion
Open-source finite state transducer for Modern
Standard Arabic (AraComLex) distributed under
the GPLv3 license.
We successfully use machine learning to predict
morpho-syntactic features for newly acquired
words.
Comparing our morphological transducer to
SAMA, we find that we achieve comparable
coverage and lower rate of analyses per word.