�ݺ�ߣ

Open-Source Hebrew Search Itamar Syn-Hershko SIGTRS Meetup 22/7/2010, Jerusalem

Introduction The requirement to control masses of information Manual tagging / categorization is no longer an option Scanning text? Using an inverted index: faster, flexible, relevance Measuring TR engine: relevance, precision, recall The perfect search engine is language dependant The perfect Hebrew search engine Introducing: HebMorph Open-Source Hebrew Search: Introduction

How do search engines work? Inverted index Normalizations: Porter stemmer, s-stemmer, Soundex etc. Stemming, so (looking, looked, looker) equal “look”, and book will return “books”. Open-Source Hebrew Search: Introduction

The Challenge Open-Source Hebrew Search

Tokens Ambiguity With Niqqud, Hebrew is no different than any other non-Semitic language Niqqud-less spelling yields more than one possible meaning to almost any given word English: Look, Luke; Wine, Whine; Stack, Stuck. Hebrew: שָנִי , שֵנִי , שְנֵי , שֹנִי , שְנִי Niqqud-less spelling: שני , שני , שני , שני , שני … Open-Source Hebrew Search: The Challenge

Particles Separation Hebrew word uses particles for context Without removing suffixes, relevant words might be skipped (for example: חבלה ) Without removing prefixes, relevant words will not be looked up at all Ambiguity makes affixes removal impossible in many cases בית -> הבית , בבית , שבבית , לבית , והבית ... הרכבת -> רותי פספסה את ה רכבת הרכבת המוצר מסובכת להפליא כלבי -> ? שבתו – > ? Open-Source Hebrew Search: The Challenge

Spelling Rules? There is no common agreement over rules for Niqqud-less spelling, like the one exists for diacriticized Hebrew Even spelling in common agreement isn’t always being widely used Did you know the correct spelling for “mother” is “ אימא “ ? The same word can be spelled differently by different writers, or even by the same writer שירות / שרות / שיירות דוגמא / דוגמה Open-Source Hebrew Search: The Challenge

!(Spelling Rules) Most debates are over spelling of nouns and loanwords, which have the greatest value in IR An extra layer of ambiguity, where each author or user can choose the spelling he likes אחשורוש או אחשוורוש ? שבדיה או שוודיה ? טורקיה או תורכיה ? פריס או פריז ? או אולי פאריז ? Open-Source Hebrew Search: The Challenge

Noise Reduction Stop words ambiguity אשר , כדי , אף ... Stop words as collations על ידי , אי פעם , אף על פי , שום דבר ... Collations where a meaning of a single word is changed פי התהום Open-Source Hebrew Search: The Challenge

Tokenization Challenges Hebrew acronyms use double-quotes character, which is usually considered as punctuation character by most tokenizers Same with Geresh, which is used for abbrevations Geresh is also used for חצ " ץ ג " ז … and ambiguity again: אינצ ' Open-Source Hebrew Search: The Challenge

Common Texts Various dialects may present OOV cases, or change a meaning ( חמר , חמרא ), hence require different handling Each corpus might hold more than one dialect Even partial Niqqud can help disambiguation Niqqud-less spelling is the most common nowadays Open-Source Hebrew Search: The Challenge

Ways of Resolution Open-Source Hebrew Search

What to Index? Deciding on an “indexing unit” is the cornerstone of any good performing search engine For Hebrew we have: The original term (and possibly using wildcards?) Hebrew triliteral root Lemma ( דלת ← דלתותינו ) Psuedo-lemma, Stem Considerations Open-Source Hebrew Search: Ways of Resolution

Hebrew NLP Methods To analyze a Hebrew word, NLP tools have to be used: Dictionary-based approach Algorithmic approach Comparison criteria include: Morphological precision (handling of 4-5 roots, broken plurals, assimilation, etc.) Handling of loanwords, names and slang Toleration of spelling differences Disambiguation (error rate, POS, ranking) Open-Source Hebrew Search: Ways of Resolution

Dictionary vs Algorithm Dictionaries are easier to build and maintain, but they need much more on-going attention and coverage tests Easier to support non-exact matches with an algorithm Prerequisites and dependencies Hand-crafted dictionaries with morphological information, and corpora generated dictionaries with statistical data Open-Source Hebrew Search: Ways of Resolution

Lemma Disambiguation In order to index a correct lemma, a good disambiguation process needs to be used POS tools, grammatical or statistical, is the only reliable way to correctly eliminate false positives Even with such tools, ambiguity may exist: " המראה של מטוסים ריקים [...]" " ראש הממשלה בבון " Open-Source Hebrew Search: Ways of Resolution

NLP-based Hebrew Text Retrieval Filter lemmas based on their rank, morphological characteristics or statistical data OOV cases can be saved as-is, have affixes removed from them, or compared to a list of known words (i.e. names and addresses) Removal of stop and noise words Term expansion (soundex, synonyms) Save lemma to index (multiple lemmas at the same position) Open-Source Hebrew Search: Ways of Resolution

Other Text Retrieval Methods Is morphological analysis necessary? Available methods: Light-stemming Word truncation N-grams Skipgrams (Sub-types) Require no extra overhead F avorable, even when not superior Disadvantages: larger index size, slower searches (for some) Open-Source Hebrew Search: Ways of Resolution

… Applied on Semitic Languages Researches have shown 4-grams and light stemmers (“light-10”) to work better than morphological lemmatizers for Arabic Apparently, good relevance can be achieved without ‘knowing’ the language Computers vs Humans Lemmatization and disambiguation processes do make mistakes Contextual processing can fail for short queries, producing incorrect searches Open-Source Hebrew Search: Ways of Resolution

The Best Retrieval Method for Hebrew Texts Arabic and Hebrew share many morphologic phenomenas … but they do differ Without trying, we can never know Where HebMorph comes in Open-Source Hebrew Search: Ways of Resolution

HebMorph’s Approach Open-Source Hebrew Search

HebMorph … is a free , open-source effort for making Hebrew properly searchable by various IR software libraries, while maintaining decent recall, precision and relevance in retrievals. 2 goals Development is done with Lucene (why?) MorphAnalyzer, Hebrew.SimpleAnalyzer (+ duality) OpenRelevance Open-Source Hebrew Search: HebMorph’s Approach

Indexing Flow Chart Open-Source Hebrew Search: HebMorph’s Approach

Searching Wikipedia with BzReader and HebMorph Source available from http://github.com/synhershko/BzReader Open-Source Hebrew Search: HebMorph’s Approach

The Road Ahead A better tokenizer MorphAnalyzer: Hspell improvements (coverage, lemma probabilities, prefixes probabilities) Toleration guidelines Smarter OOV handling Better stop words handling Hebrew judgments for OpenRelevance with Orev Comparing various approaches to Hebrew IR Wide availability (Java port underway!) Other uses (NLP, OCR, you name it) Open-Source Hebrew Search: HebMorph’s Approach

Join Us! The more people join, the more feedback we get, and the better we become. Our mailing list: https://lists.sourceforge.net/lists/listinfo/hebmorph-thinktank Code repository (Released under GPLv2): http :// github . com / synhershko / HebMorph Activity updates: http://www.code972.com/blog/hebmorph/ #HebMorph on Twitter Open-Source Hebrew Search: HebMorph’s Approach

Thank you! Open-Source Hebrew Search

�ݺ�ߣ

Open-source Hebrew search

More Related Content

Open-source Hebrew search