際際滷

際際滷Share a Scribd company logo
A Lexical Database for Modern Standard
Arabic Interoperable with a Finite State
      Morphological Transducer

   Mohammed Attia, Pavel Pecina, Antonio Toral, Lamia
            Tounsi, Josef van Genabith

       National Centre for Language Technology (NCLT),
          School of Computing, Dublin City University

                             Funded by:
      Enterprise Ireland, the Irish Research Council for Science
            Engineering and Technology (IRCSET), and
           the EU projects PANACEA and META-NET
Contribution
 We develop a finite state morphological
  transducer for Modern Standard Arabic
  1. Open source, distributed under the GPLv3 license
  2. Large scale, more than 30,000 lemmas
  3. Corpus based, truly representative of Modern
     Standard Arabic and not Classical Arabic.
  4. Compatible with Foma, an open-source fst compiler
Short Tutorial
(1) Download Foma
     http://foma.sourceforge.net

(2) Download AraComLex
     http://aracomlex.sourceforge.net

(3) Build the transducer: README
The transducer online
 You can test the transducer online:
    http://www.cngl.ie/aracomlex
Introduction
 Modern Standard Arabic vs. Classical
  Arabic
 Current State of Arabic Lexicography
   Lexicons are not corpus-based
   Buckwalter Arabic Morphological Analyser
 Importance of Lexical Resources
Introduction
   Arabic Morphotactics




Pattern


lemma
Aim


  Constructing a lexical database for Modern
Standard Arabic

  Building a finite-state morphological
transducer
Methodology
 Using a medium-scale manually created lexicon
  of 10,799 lemmas, with detailed info for:
    Nouns (human/nonhuman, POS, Continuation Classes)
    Verbs (transitive/intransitive, allow passive, allow
     imperative)
 Using statistics from a 1 billion word corpus
    90% from the LDC's Arabic Gigaword
    10% collected from the Al-Jazeera website
 Using a pre-annotation tool: MADA+TOKAN
Methodology
 Using Finite State Technology (XFST)
   Bidrectional: Suitable for analysis and generation
   handles concatenative and non-concatenative
    morphotactics
   Speed and efficiency in dealing with millions of
    paths
   Handles separated dependencies.
   Handles phonological and orthographic changes
    through alteration rules.
Methodology
 Design Approach:
  Three approaches
   Root-based Morphology
  Xerox Arabic FTM
   Stem-based morphology
  Buckwalter
  $kr   $akar        PV thank;give thanks
  $kr   $okur   IV   thank;give thanks
   Lemma-based morphology
Methodology
Our Approach:
Lemma-based morphology
Methodology
Methodology
Alteration Rules:
Alteration Rules are used for handling discrepancies
between surface forms and underlying representation or
lemmas. We have 130 replace rules.
      a -> b || L _ R
Results to Date
 Start-off with a seed lexicon
   Four Lexical Databases, manually constructed
        5,925 nominal lemmas
        1,529 verb lemmas
        490 patterns (456 for nominals and 34 for verbs)
        lemma-root look up database
Results to Date
 Automatically Extending the Lexical
  Database: Lexical Enrichment
   Data-driven filtering technique
        40,648 lemmas (in Buckwalter or SAMA 3.1)
        Statistics from three web search engines
        Statistics from the corpus annotated by MADA
        29,627 lemmas (left after filtering)
Results to Date
Automatically Extending the Lexical
Database: Feature Enrichment
   Machine Learning
   Multilayer Peceptron classification algorithm
   Training Data: 4,816 nominals and 1,448 verbs
   Classes for nominals: continuation classes (or inflection
    paths), the semantico-grammatical feature of humanness,
    and POS (noun or adjective)
   Classes for verbs: transitivity, allowing the passive voice,
    and allowing the imperative mood
   We feed these datasets with frequency statistics from the
    corpus and build a vector grid.
Results to Date
 Extending the Lexical Database
   Feature enrichment using Machine Learning
Results to Date
 Extending the Lexical Database
   With Machine Learning we add:
   
     18,000 new lemmas:
    
      12,974 nominals
    
      5,034 verbs
Results to Date
 AraComLex Lexicon Writing Application
Results to Date
 FST Morphology Coverage and RPW
  Results
   a test corpus of 800,000 words, divided as
      400,000 for Semi-Literary text
      400,000 for General News texts.
Future Work
 Going beyond SAMA
 Including Named Entities and MWEs
 Building a spell checker
Conclusion
 Open-source finite state transducer for Modern
  Standard Arabic (AraComLex) distributed under
  the GPLv3 license.
 We successfully use machine learning to predict
  morpho-syntactic features for newly acquired
  words.
 Comparing our morphological transducer to
  SAMA, we find that we achieve comparable
  coverage and lower rate of analyses per word.

More Related Content

Attia sfcm presentation

  • 1. A Lexical Database for Modern Standard Arabic Interoperable with a Finite State Morphological Transducer Mohammed Attia, Pavel Pecina, Antonio Toral, Lamia Tounsi, Josef van Genabith National Centre for Language Technology (NCLT), School of Computing, Dublin City University Funded by: Enterprise Ireland, the Irish Research Council for Science Engineering and Technology (IRCSET), and the EU projects PANACEA and META-NET
  • 2. Contribution We develop a finite state morphological transducer for Modern Standard Arabic 1. Open source, distributed under the GPLv3 license 2. Large scale, more than 30,000 lemmas 3. Corpus based, truly representative of Modern Standard Arabic and not Classical Arabic. 4. Compatible with Foma, an open-source fst compiler
  • 3. Short Tutorial (1) Download Foma http://foma.sourceforge.net (2) Download AraComLex http://aracomlex.sourceforge.net (3) Build the transducer: README
  • 4. The transducer online You can test the transducer online: http://www.cngl.ie/aracomlex
  • 5. Introduction Modern Standard Arabic vs. Classical Arabic Current State of Arabic Lexicography Lexicons are not corpus-based Buckwalter Arabic Morphological Analyser Importance of Lexical Resources
  • 6. Introduction Arabic Morphotactics Pattern lemma
  • 7. Aim Constructing a lexical database for Modern Standard Arabic Building a finite-state morphological transducer
  • 8. Methodology Using a medium-scale manually created lexicon of 10,799 lemmas, with detailed info for: Nouns (human/nonhuman, POS, Continuation Classes) Verbs (transitive/intransitive, allow passive, allow imperative) Using statistics from a 1 billion word corpus 90% from the LDC's Arabic Gigaword 10% collected from the Al-Jazeera website Using a pre-annotation tool: MADA+TOKAN
  • 9. Methodology Using Finite State Technology (XFST) Bidrectional: Suitable for analysis and generation handles concatenative and non-concatenative morphotactics Speed and efficiency in dealing with millions of paths Handles separated dependencies. Handles phonological and orthographic changes through alteration rules.
  • 10. Methodology Design Approach: Three approaches Root-based Morphology Xerox Arabic FTM Stem-based morphology Buckwalter $kr $akar PV thank;give thanks $kr $okur IV thank;give thanks Lemma-based morphology
  • 13. Methodology Alteration Rules: Alteration Rules are used for handling discrepancies between surface forms and underlying representation or lemmas. We have 130 replace rules. a -> b || L _ R
  • 14. Results to Date Start-off with a seed lexicon Four Lexical Databases, manually constructed 5,925 nominal lemmas 1,529 verb lemmas 490 patterns (456 for nominals and 34 for verbs) lemma-root look up database
  • 15. Results to Date Automatically Extending the Lexical Database: Lexical Enrichment Data-driven filtering technique 40,648 lemmas (in Buckwalter or SAMA 3.1) Statistics from three web search engines Statistics from the corpus annotated by MADA 29,627 lemmas (left after filtering)
  • 16. Results to Date Automatically Extending the Lexical Database: Feature Enrichment Machine Learning Multilayer Peceptron classification algorithm Training Data: 4,816 nominals and 1,448 verbs Classes for nominals: continuation classes (or inflection paths), the semantico-grammatical feature of humanness, and POS (noun or adjective) Classes for verbs: transitivity, allowing the passive voice, and allowing the imperative mood We feed these datasets with frequency statistics from the corpus and build a vector grid.
  • 17. Results to Date Extending the Lexical Database Feature enrichment using Machine Learning
  • 18. Results to Date Extending the Lexical Database With Machine Learning we add: 18,000 new lemmas: 12,974 nominals 5,034 verbs
  • 19. Results to Date AraComLex Lexicon Writing Application
  • 20. Results to Date FST Morphology Coverage and RPW Results a test corpus of 800,000 words, divided as 400,000 for Semi-Literary text 400,000 for General News texts.
  • 21. Future Work Going beyond SAMA Including Named Entities and MWEs Building a spell checker
  • 22. Conclusion Open-source finite state transducer for Modern Standard Arabic (AraComLex) distributed under the GPLv3 license. We successfully use machine learning to predict morpho-syntactic features for newly acquired words. Comparing our morphological transducer to SAMA, we find that we achieve comparable coverage and lower rate of analyses per word.