際際滷

際際滷Share a Scribd company logo
Introduction to
Apache Solr
Hassan Nasr Esfahani
Topics
 What we need from a text search engine
 What is Solr?
 Why Solr?
 Concepts And Architecture
 Usage
 Special Features
 Competitors
Text Retrieval vs Database
Retrieval
 Information and Query
 Unstructured vs Structured
 Ambiguous vs Well defined
 Answers
 Relevant documents (ambiguous) vs matched
documents
What we want from text search
engine
Basic Search Features:
 Store some documents with some fields
 Query for documents
Text Search Features
 Find most relevant docs
 Handle Natural language Complications (stop words, stem, tokenizing  )
 Highlight text
Problems with Text Search
SampleProblem
惘擧惠悋惡愆忰惆惶悋惆Tokenization
Different Letter representation
惘惘惘惆Similar words
惺悛慍擯悋惘Synonymous words
愆惘Word ambiguity
惡悋悋愕惠惡惘惠...Stop words
擯悵悋惘愆Spell errors
Spoken language
What is Solr?
 An Open Search Engine
 Written in Java
 Wrapping Apache Lucene
 With REST API
 Fault tolerant
 Scalable
 Distributable
Solr Simple Architecture
Apache Lucene
Query Documents
Analyze
Queue
1
2
3
Analyze
Queue
1
2
3
Schema.
xml
Solr Featues
 Advanced Search Method
 Language knowledge
 Scoring/Boosting
 Grouping
 Highlighting
 Nested Documents
 Realtime index update
How It Works
 Solr Server Contains Some Core ( similar to datebase in
DBMS )
 Each Core specified by schema.xml + 
 Fields
 Data Types
 Analyzers
Field List
Field Attributes:
Type
Indexed
Stored
Multivalued
Data Types
 Int, float , long, double
 Date
 String
 Text ( configurable )
 Location
Communicating with Solr
 REST API
 Client Libraris
 JAVA
 Ruby
 PHP
 C#
 Python
 
 Data Import Handlers
 Direct SQL query
Query Format
 Dirfferent Query Parsers:
 Standard(Lucene)
 Dismax
 Edismax
 Block Join Query Parser
Standard Query Format
 field:Value
 Phrase search : field:"word list"
 Wildcard search : wor?d , word*
 Fuzzy Searches : roam~ matches all terms like foam or foams (max 2 edit distance)
 Proximity Searches (words with maximum distance): "jakarta apache"~10
 Range Searches: [52 TO 10000] or {Aida TO Carmen}
 Bossting: jakarta^4 apache
 Boolean Operators : AND (&&) , OR(||) , NOT(!) , + , - , ( )
 Filter Query
Relevancy
 Term Frequency
  Inverse Document Frequency
 Query Expansion
Weakness
 No Transactions
 No join query
 Use as secondary database
 No partial record modification
Alternative
 Elasticsearch based on Search
 Mostly towards Analytic Usage
 More popular
 Easier to start
 Less Documented
Introduction to Apache Solr

More Related Content

Introduction to Apache Solr

  • 2. Topics What we need from a text search engine What is Solr? Why Solr? Concepts And Architecture Usage Special Features Competitors
  • 3. Text Retrieval vs Database Retrieval Information and Query Unstructured vs Structured Ambiguous vs Well defined Answers Relevant documents (ambiguous) vs matched documents
  • 4. What we want from text search engine Basic Search Features: Store some documents with some fields Query for documents Text Search Features Find most relevant docs Handle Natural language Complications (stop words, stem, tokenizing ) Highlight text
  • 5. Problems with Text Search SampleProblem 惘擧惠悋惡愆忰惆惶悋惆Tokenization Different Letter representation 惘惘惘惆Similar words 惺悛慍擯悋惘Synonymous words 愆惘Word ambiguity 惡悋悋愕惠惡惘惠...Stop words 擯悵悋惘愆Spell errors Spoken language
  • 6. What is Solr? An Open Search Engine Written in Java Wrapping Apache Lucene With REST API Fault tolerant Scalable Distributable
  • 7. Solr Simple Architecture Apache Lucene Query Documents Analyze Queue 1 2 3 Analyze Queue 1 2 3 Schema. xml
  • 8. Solr Featues Advanced Search Method Language knowledge Scoring/Boosting Grouping Highlighting Nested Documents Realtime index update
  • 9. How It Works Solr Server Contains Some Core ( similar to datebase in DBMS ) Each Core specified by schema.xml + Fields Data Types Analyzers
  • 11. Data Types Int, float , long, double Date String Text ( configurable ) Location
  • 12. Communicating with Solr REST API Client Libraris JAVA Ruby PHP C# Python Data Import Handlers Direct SQL query
  • 13. Query Format Dirfferent Query Parsers: Standard(Lucene) Dismax Edismax Block Join Query Parser
  • 14. Standard Query Format field:Value Phrase search : field:"word list" Wildcard search : wor?d , word* Fuzzy Searches : roam~ matches all terms like foam or foams (max 2 edit distance) Proximity Searches (words with maximum distance): "jakarta apache"~10 Range Searches: [52 TO 10000] or {Aida TO Carmen} Bossting: jakarta^4 apache Boolean Operators : AND (&&) , OR(||) , NOT(!) , + , - , ( ) Filter Query
  • 15. Relevancy Term Frequency Inverse Document Frequency Query Expansion
  • 16. Weakness No Transactions No join query Use as secondary database No partial record modification
  • 17. Alternative Elasticsearch based on Search Mostly towards Analytic Usage More popular Easier to start Less Documented