This document describes building a Spanish version of MetaMap, a tool for extracting biomedical concepts from text, using automatic translation and biomedical ontologies. The researchers translated English medical texts to Spanish using Google Translate, then applied the English MetaMap tool to the translations. They evaluated this approach on a bilingual text collection, finding that it achieved classification performance comparable to using MetaMap directly on the English texts. This "easy way" of reusing resources across languages through translation is a promising approach for adapting existing natural language processing tools.
1 of 20
Download to read offline
More Related Content
Presentaci¨®n en IDEAL 2008
1. Building a Spanish MMTx by
using Automatic Translation and
Biomedical Ontologies
Francisco Carrero 1,2 ; Jos¨¦ Carlos Cortizo 1,2 ; Jos¨¦ M? G¨®mez 3
1 Wipley, Social Gaming Platform
http://www.wipley.com
2 Universidad Europea de Madrid
http://www.esp.uem.es/gsi
3 Optenet
http://www.esp.uem.es/gsi
2. Outline
The MIRCAT project
The challenge
English MetaMap, a big effort
Approaching a Spanish MetaMap
Experiments
Discussion of the Results and Future Work
Francisco Carrero Garcia
6. The Challenge
The problem
We can extract UMLS concepts from English texts using
MetaMap...
...but there is no Spanish version of MetaMap
Is it dif?cult to construct a tool like MetaMap?
Francisco Carrero Garcia
10. Experimental Design
Text Collections
MedLine Plus medical News
http://www.nlm.nih.gov/medlineplus/newsbydate.html
Excellent online resource
2000 news, some in English, some in Spanish
600 available in both languages
Francisco Carrero Garcia
11. Experiments
Experimental Design
MetaMap extracts concepts, allowing multiple representations
A => Using compound concepts
B => simple concepts
1 => resolves ambiguity by adding all the concepts
2 => ignores ambiguities by choosing the ?rst possibility
4 representations: A1, A2, B1, B2
Francisco Carrero Garcia
12. Experiments
Filtering
Data representations containing a lot of features do not usually
perform very well in text tasks
Many classi?ers degrade in prediction accuracy when faced with
many irrelevant features or redundant/correlated ones (¡°curse
of dimensionality¡±)
We apply Zipf¡¯s Law to ?lter the attributes
Francisco Carrero Garcia
16. Discussion of the Results
Translation
The worst results (similarity) are achieved with the most
complex (near to humans) representation: A1
B1 is less complex and produces the best results
=> Our model seems to be more suitable as a plain bag-of-
concepts representation
Similar to bag-of-words representation, widely used in text
processing tasks
Francisco Carrero Garcia
17. Discussion of the Results
Classi?cation
All results are comparable to classi?cation on original English
texts
In some cases, are even better
Best results using A2+Zipf, +7.8% in AUC
UNMKD representations never achieves worse classi?cations than
English
Francisco Carrero Garcia
18. Conclussions and Future Work
The ¡°easy way¡± to construct a Spanish MetaMap is promising
Google Translation seems a good tool to adapt English resources
to any other languages (like Spanish)
We should try other translation tools
We are working on applying this approach to other text tasks
(like Information Retrieval and Filtering)
Francisco Carrero Garcia
19. Ending...
Thank you very much for your attention
Francisco Carrero Garcia