This document describes three techniques for automatically extracting multiword expressions from Arabic texts:
1. Using crosslingual correspondences between Arabic Wikipedia titles and their translations in other languages, assuming MWEs are less likely to have one-to-one translations.
2. Translating nominal MWEs from Princeton WordNet into Arabic using Google Translate and validating using search engine frequency counts.
3. Applying association measures like PMI and chi-square to n-grams in the Arabic Gigaword corpus after lemmatization and POS filtering.
The combination of techniques utilizing multilingual data, dictionaries and corpora enriched the extracted Arabic MWE lexicon with over 33,000 MWEs and 39,000
1 of 24
Download to read offline
More Related Content
Arabic mwe presentation 07
1. Automatic Extraction of Arabic
Multiword Expressions
*Mohammed Attia, Antonio Toral, Lamia Tounsi, Pavel Pecina and
Josef van Genabith
School of Computing, Dublin City University, Ireland
2. Outline
¢ Introduction
¢ Data Resources
¢ Methodology
¢ Crosslingual Correspondence Asymmetries
¢ Translation-Based Approach
¢ Corpus-Based Approach
¢ Discussion of experiments and results
¢ Conclusion
3. Introduction
¢ Criteria of MWEs
¢ Ubiquity
¢ Diversity
¢ Low polysemy
¢ Statistically significant co-occurrence
¢ Focus
¢ Arabic
¢ Nominal MWEs
¢ Purpose is building an MWE lexicon for Arabic
4. Data Resources
? Multilingual, bilingual and monolingual settings
? Availability of rich resources that have not been
exploited in similar tasks before.
¢ Arabic Wikipedia (March 2010)
¢ 117,491 titles, of them 89,623 multiword titles
¢ Arabic is ranked 27th according to size (article count) and
17th according to usage
¢ Information helpful for linguistic processing
5. Data Resources
¢ Princeton WordNet 3.0
¢ An electronic lexical database for English
¢ Arabic WordNet contains only 11,269 synsets (including
2,348 MWEs)
6. Data Resources
¢ Arabic Gigaword
¢ Unannotated corpus distributed by the Linguistic Data
Consortium (LDC).
¢ Articles from news agencies and newspapers from different
Arab regions, such as Al-Ahram in Egypt, An Nahar in
Lebanon and Assabah in Tunisia.
¢ Largest publicly available corpus of Arabic to date.
¢ Contains 848 million words.
7. Methodology
3 different techniques for 3 different data sources
Motivation for using different techniques
¢ The extraction of MWEs is a problem more complex than
can be dealt with by one simple solution.
¢ The choice of technique depends on the nature of the task
and the type of the resources used.
9. Technique 1: Crosslingual Asymmetries
¢ Data: Titles of Wikipedia Articles in Arabic and corresponding
titles in 21 languages.
¢ Definition: We rely on many-to-one correspondence relations
¢ The non-compositionality of MWEs makes it unlikely to have
a mirrored representation in the other languages.
¢ Compositionalily varies:
¢ highly compositional, "?" ,"????? ???????military base",
¢ with a degree of idiomaticity, such as, "?" ,"????? ???????amusement
park", lit. "city of amusements".
¢ extremely opaque , "?" ,"??? ??????grasshopper", lit. "the horse of the
Prophet".
10. Technique 1: Crosslingual Asymmetries
¢ Steps
(1) Candidate Selection. All Arabic Wikipedia multiword titles
are taken as candidates.
(2) Filtering. We exclude titles of disambiguation and
administrative pages.
(3) Validation. We check if there is a single-word translation in
any of 21 selected languages.
11. Technique 1: Crosslingual Asymmetries
¢ Evaluation:
¢ 1100 multiword titles are randomly selected from Arabic
Wikipedia and manually tagged as: MWEs, non-MWEs, or
NEs.
¢ Baseline: all multi-word titles are considered as MWEs
¢ Results
14. Technique 2: Translation-Based
¢ Data: Princeton WordNet
¢ Assumption: MWEs in one language are likely to be
translated as MWE in another language.
¢ Ontological advantage
¢ Steps
¢ Extracting the list of nominal MWEs from PWN 3.0.
¢ Translating the list into Arabic using Google Translate.
¢ Validating the results using pure frequency counts from three
search engines: Al-Jazeera, BBC Arabic and AWK.
15. Technique 2: Translation-Based
¢ Evaluation (automatic)
¢ Gold Standard: PWN-MWEs found in English Wikipedia and have
correspondence in Arabic: 6322 expressions.
¢ We test the Google translation without any filtering, and consider this as
the baseline.
¢ Then we filter the output based on the number of combined hits from the
search engines.
¢ Results
16. Technique 2: Translation-Based
¢ Evaluation (Manual)
¢ On 200 MWE candidates
¢ Precision
C Baseline (before validation): 45.5%
C After validation: 83%
17. Technique 2: Translation-Based
¢ Notes on Google Translate
¢ Word Order
C shark repellent => ?????? ?????
C accordion door => ?????????? ??????
¢ Transferring source word to target
C acroclinium roseum => acroclinium roseum
C actitis hypoleucos => actitis hypoleucos
18. Technique 3: Corpus-Based
¢ Data: Arabic Gigaword corpus
¢ Association Measures used:
¢ Pointwise Mutual Information (PMI)
¢ Pearson¨s chi-square
¢ Steps
(1) Compute the frequency of all the unigrams, bigrams, and trigrams
(2) Computing the association measures for all bigrams and trigrams (threshold to 50)
(3) Ranking bigrams and trigrams
(4) Conducting lemmatization of Arabic words using MADA.
(5) Filtering the list using the MADA POS-tagger. The patterns included for bigrams are: NN NA, and for
trigrams: NNN NNA NAA
19. Technique 3: Corpus-Based
¢ Why is lemmatization important?
¢ Al>mm AlmtHdp
(the-nations united) ^the United Nations ̄
Al>mm@>um~ap_1@N@1#AlmtHdp@mut~aHid_1@AJ@2#
¢ ll>mm AlmtHdp
(to-the-nations united) ^to the United Nations ̄
ll>mm@>um~ap_1@N@3#AlmtHdp@mut~aHid_1@AJ@3#
¢ wAl>mm AlmtHdp
(and-the-nations united) ^and the United Nations ̄
wAl>mm@>um~ap_1@c-N@3#AlmtHdp@mut~aHid_1@AJ@3#
¢ bAl>mm AlmtHdp
(by-the-nations united) ^by the United Nations ̄
bAl>mm@>um~ap_1@N@3#AlmtHdp@mut~aHid_1@AJ@3#
20. Technique 3: Corpus-Based
¢ Evaluation: 3600 expressions are randomly selected
and classified into MWE or non-MWE by a human
annotator.
¢ Results
22. Discussion of results
¢ Similarities and dissimilarities of output
The set of collocations detected by the association
measures may differ from the those which capture the
interest of lexicographers and Wikipedians
¢ ?????? ?????? ^Menachem Mazuz ̄
¢ ??????? ?????? ^fresh fruits ̄
¢ ??????? ??????? ^Ladies and gentlemen ̄
23. Conclusion
¢ Applicability to other languages
¢ the heterogeneity of the data sources helps to enrich
the MWE lexicon.
¢ A lexical resource of:
¢ 33,000 MWEs
¢ 39,000 NEs