�ݺ�ߣ

Introduction to
Apache Solr
Hassan Nasr Esfahani

Topics
– What we need from a text search engine
– What is Solr?
– Why Solr?
– Concepts And Architecture
– Usage
– Special Features
– Competitors

Text Retrieval vs Database
Retrieval
– Information and Query
– Unstructured vs Structured
– Ambiguous vs Well defined
– Answers
– Relevant documents (ambiguous) vs matched
documents

What we want from text search
engine
Basic Search Features:
– Store some documents with some fields
– Query for documents
Text Search Features
– Find most relevant docs
– Handle Natural language Complications (stop words, stem, tokenizing … )
– Highlight text
– …

Problems with Text Search
SampleProblem
‫‌روم‬‫ی‬‫م‬‫،‌کتابش‌،‌محمد‌صادق‬Tokenization
‫ی‌و‌ي‬Different Letter representation
‫میروم‌،‌میروی،‌میرود‬Similar words
‫معلم‌و‌آموزگار‬Synonymous words
‫شیر‬Word ambiguity
‌،‫با،‌است‌،‌به،‌رفت‬...Stop words
‫گذارش‬Spell errors
‫نون‬Spoken language

What is Solr?
– An Open Search Engine
– Written in Java
– Wrapping Apache Lucene
– With REST API
– Fault tolerant
– Scalable
– Distributable

Solr Simple Architecture
Apache Lucene
Query Documents
Analyze
Queue
1
2
3
Analyze
Queue
1
2
3’
Schema.
xml

Solr Featues
– Advanced Search Method
– Language knowledge
– Scoring/Boosting
– Grouping
– Highlighting
– Nested Documents
– Realtime index update

How It Works
– Solr Server Contains Some Core ( similar to datebase in
DBMS )
– Each Core specified by schema.xml + …
– Fields
– Data Types
– Analyzers

Field List
Field Attributes:
Type
Indexed
Stored
Multivalued
…

Data Types
– Int, float , long, double
– Date
– String
– Text ( configurable )
– Location

Communicating with Solr
– REST API
– Client Libraris
– JAVA
– Ruby
– PHP
– C#
– Python
– …
– Data Import Handlers
– Direct SQL query

Query Format
– Dirfferent Query Parsers:
– Standard(Lucene)
– Dismax
– Edismax
– Block Join Query Parser
– …

Standard Query Format
– field:Value
– Phrase search : field:"word list"
– Wildcard search : wor?d , word*
– Fuzzy Searches : roam~ matches all terms like foam or foams (max 2 edit distance)
– Proximity Searches (words with maximum distance): "jakarta apache"~10
– Range Searches: [52 TO 10000] or {Aida TO Carmen}
– Bossting: jakarta^4 apache
– Boolean Operators : AND (&&) , OR(||) , NOT(!) , + , - , ( )
– Filter Query

Relevancy
– ∝Term Frequency
– ∝ Inverse Document Frequency
– Query Expansion
– …

Weakness
– No Transactions
– No join query
– Use as secondary database
– No partial record modification

Alternative
– Elasticsearch based on Search
– Mostly towards Analytic Usage
– More popular
– Easier to start
– Less Documented

�ݺ�ߣ

Introduction to Apache Solr

More Related Content

Introduction to Apache Solr