HebMorph is an open-source project that aims to make Hebrew text properly searchable by various information retrieval (IR) software libraries while maintaining good recall, precision, and relevance. It uses morphological analysis to index text at the lemma level and includes a tokenizer and analyzer implemented for the Lucene search engine library. The project seeks to improve coverage of its morphological analyzer and develop better techniques for handling out-of-vocabulary words and stop words to enhance Hebrew text search capabilities.
2. Introduction The requirement to control masses of information Manual tagging / categorization is no longer an option Scanning text? Using an inverted index: faster, flexible, relevance Measuring TR engine: relevance, precision, recall The perfect search engine is language dependant The perfect Hebrew search engine Introducing: HebMorph Open-Source Hebrew Search: Introduction
3. How do search engines work? Inverted index Normalizations: Porter stemmer, s-stemmer, Soundex etc. Stemming, so (looking, looked, looker) equal look, and book will return books. Open-Source Hebrew Search: Introduction
5. Tokens Ambiguity With Niqqud, Hebrew is no different than any other non-Semitic language Niqqud-less spelling yields more than one possible meaning to almost any given word English: Look, Luke; Wine, Whine; Stack, Stuck. Hebrew: 廩峺峇 , 廩峙峇 , 廩岼峙 , 廩峭峇 , 廩岼峇 Niqqud-less spelling: 廩 , 廩 , 廩 , 廩 , 廩 Open-Source Hebrew Search: The Challenge
6. Particles Separation Hebrew word uses particles for context Without removing suffixes, relevant words might be skipped (for example: ) Without removing prefixes, relevant words will not be looked up at all Ambiguity makes affixes removal impossible in many cases 廬 -> 廬 , 廬 , 廩廬 , 廬 , 廬 ... 廨廬 -> 廨廬 廚廖廚廖 廬 廨廬 廨廬 廢廨 廖廬 廚 -> ? 廩廬 > ? Open-Source Hebrew Search: The Challenge
7. Spelling Rules? There is no common agreement over rules for Niqqud-less spelling, like the one exists for diacriticized Hebrew Even spelling in common agreement isnt always being widely used Did you know the correct spelling for mother is ? The same word can be spelled differently by different writers, or even by the same writer 廩廨廬 / 廩廨廬 / 廩廨廬 / Open-Source Hebrew Search: The Challenge
8. !(Spelling Rules) Most debates are over spelling of nouns and loanwords, which have the greatest value in IR An extra layer of ambiguity, where each author or user can choose the spelling he likes 廩廨廩 廩廨廩 ? 廩 廩 ? 廨廡 廬廨 ? 廚廨廖 廚廨 ? 廚廨 ? Open-Source Hebrew Search: The Challenge
9. Noise Reduction Stop words ambiguity 廩廨 , , 廝 ... Stop words as collations 廣 , 廚廣 , 廝 廣 廚 , 廩 廨 ... Collations where a meaning of a single word is changed 廚 廬 Open-Source Hebrew Search: The Challenge
10. Tokenization Challenges Hebrew acronyms use double-quotes character, which is usually considered as punctuation character by most tokenizers Same with Geresh, which is used for abbrevations Geresh is also used for 廢 " 廛 " and ambiguity again: 廢 ' Open-Source Hebrew Search: The Challenge
11. Common Texts Various dialects may present OOV cases, or change a meaning ( 廨 , 廨 ), hence require different handling Each corpus might hold more than one dialect Even partial Niqqud can help disambiguation Niqqud-less spelling is the most common nowadays Open-Source Hebrew Search: The Challenge
13. What to Index? Deciding on an indexing unit is the cornerstone of any good performing search engine For Hebrew we have: The original term (and possibly using wildcards?) Hebrew triliteral root Lemma ( 廬 廬廬 ) Psuedo-lemma, Stem Considerations Open-Source Hebrew Search: Ways of Resolution
14. Hebrew NLP Methods To analyze a Hebrew word, NLP tools have to be used: Dictionary-based approach Algorithmic approach Comparison criteria include: Morphological precision (handling of 4-5 roots, broken plurals, assimilation, etc.) Handling of loanwords, names and slang Toleration of spelling differences Disambiguation (error rate, POS, ranking) Open-Source Hebrew Search: Ways of Resolution
15. Dictionary vs Algorithm Dictionaries are easier to build and maintain, but they need much more on-going attention and coverage tests Easier to support non-exact matches with an algorithm Prerequisites and dependencies Hand-crafted dictionaries with morphological information, and corpora generated dictionaries with statistical data Open-Source Hebrew Search: Ways of Resolution
16. Lemma Disambiguation In order to index a correct lemma, a good disambiguation process needs to be used POS tools, grammatical or statistical, is the only reliable way to correctly eliminate false positives Even with such tools, ambiguity may exist: " 廨 廩 廖 廨廡 [...]" " 廨廩 廩 " Open-Source Hebrew Search: Ways of Resolution
17. NLP-based Hebrew Text Retrieval Filter lemmas based on their rank, morphological characteristics or statistical data OOV cases can be saved as-is, have affixes removed from them, or compared to a list of known words (i.e. names and addresses) Removal of stop and noise words Term expansion (soundex, synonyms) Save lemma to index (multiple lemmas at the same position) Open-Source Hebrew Search: Ways of Resolution
18. Other Text Retrieval Methods Is morphological analysis necessary? Available methods: Light-stemming Word truncation N-grams Skipgrams (Sub-types) Require no extra overhead F avorable, even when not superior Disadvantages: larger index size, slower searches (for some) Open-Source Hebrew Search: Ways of Resolution
19. Applied on Semitic Languages Researches have shown 4-grams and light stemmers (light-10) to work better than morphological lemmatizers for Arabic Apparently, good relevance can be achieved without knowing the language Computers vs Humans Lemmatization and disambiguation processes do make mistakes Contextual processing can fail for short queries, producing incorrect searches Open-Source Hebrew Search: Ways of Resolution
20. The Best Retrieval Method for Hebrew Texts Arabic and Hebrew share many morphologic phenomenas but they do differ Without trying, we can never know Where HebMorph comes in Open-Source Hebrew Search: Ways of Resolution
22. HebMorph is a free , open-source effort for making Hebrew properly searchable by various IR software libraries, while maintaining decent recall, precision and relevance in retrievals. 2 goals Development is done with Lucene (why?) MorphAnalyzer, Hebrew.SimpleAnalyzer (+ duality) OpenRelevance Open-Source Hebrew Search: HebMorphs Approach
24. Searching Wikipedia with BzReader and HebMorph Source available from http://github.com/synhershko/BzReader Open-Source Hebrew Search: HebMorphs Approach
25. The Road Ahead A better tokenizer MorphAnalyzer: Hspell improvements (coverage, lemma probabilities, prefixes probabilities) Toleration guidelines Smarter OOV handling Better stop words handling Hebrew judgments for OpenRelevance with Orev Comparing various approaches to Hebrew IR Wide availability (Java port underway!) Other uses (NLP, OCR, you name it) Open-Source Hebrew Search: HebMorphs Approach
26. Join Us! The more people join, the more feedback we get, and the better we become. Our mailing list: https://lists.sourceforge.net/lists/listinfo/hebmorph-thinktank Code repository (Released under GPLv2): http :// github . com / synhershko / HebMorph Activity updates: http://www.code972.com/blog/hebmorph/ #HebMorph on Twitter Open-Source Hebrew Search: HebMorphs Approach