際際滷

際際滷Share a Scribd company logo
Search Concepts and Tools
Steve Davids
What is Search?
Image Credit: https://www.flickr.com/photos/doggybytes/ - CC License
What is Search?
Definition: try to find something by looking or
otherwise seeking carefully and thoroughly
Synonyms: hunt, look, quest
Hunt
 SELECT * FROM book WHERE author LIKE
%name%
 publisher: (Wall Street Journal OR
WSJ.com OR Wall Street Journal India...)
Look
 Relevant Results
 Use signals to boost
relevance
 Ability to quickly
whittle down results
Quest
Provide recommendations
Hunt Look Quest
What do we have in common?
 Open Source Search Library (Java API)
 Free text search
 Relevancy ranking
 Faceting and filtering
 Hit-term highlighting
 Near real-time indexing/querying
 Inverted Index
Free text search via
 Keyword
 Wildcard
 Proximity
 Fuzzy
 Range
 Geospatial
Free text search via
 Keyword
 Wildcard
 Proximity
 Fuzzy
 Range
 Geospatial
walk*
M?ham?d
M[ou]hamm?[ae]d
Free text search via
 Keyword
 Wildcard
 Proximity
 Fuzzy
 Range
 Geospatial
Free text search via
 Keyword
 Wildcard
 Proximity
 Fuzzy
 Range
 Geospatial
Free text search via
 Keyword
 Wildcard
 Proximity
 Fuzzy
 Range
 Geospatial
[* TO N]
Free text search via
 Keyword
 Wildcard
 Proximity
 Fuzzy
 Range
 Geospatial
Text Analysis
 Convert text into searchable words
 CharFilter
o Mutates single stream of text
 Tokenizer
o Splits single stream of text into one or more
tokens
 TokenFilter
o Mutates token stream
Notable Character Filters
 HTML Strip
o <p>Example <a href=/test>link</a></p>
o Example link
 Pattern Replace
o pattern="[^a-zA-Z]" replacement=""
o Testing123
o Testing
Notable Tokenizers
 Keyword
o Hello World!
o {Hello World!}
 Whitespace
o Hello World!
o {Hello, World!}
 Standard
 Pattern
 ICU (International Components for Unicode)
Notable Token Filters
 Lower Case
o {Hello, World!}
o {hello, world!}
 Synonym
o synonyms.txt (expand=true): JPN, Japan, JN
則р {to, Japan}
則р {to, {Japan, JPN, JN}}
o synonyms.txt (expand=false)
則р {to, JPN}
Notable Token Filters
 Word Delimiter
o {F22-Raptor}
o {F22, Raptor}
o {F, 22, Raptor}
o {F, {22, F22}, {Raptor, F22Raptor}}
 Porter Stem
o {walked, walking, walks}
o {walk, walk, walk}
Inverted Index
T[0] = "It is what it is"
T[1] = "what is it?"
T[2] = "it is a banana"
"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"It": {0}
"it?": {1}
"what": {0, 1}
Inverted Index
T[0] = "It is what it is"
T[1] = "what is it?"
T[2] = "it is a banana"
"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}
What is Relevant?
 TF-IDF
o Term Frequency - Inverse Document Frequency
 Boosting
o Important terms
o Signals
Relational vs Denormalized
Students
- S01 | Marina
- S02 | Ben
Classes
- C01 | Hadoop
- C02 | Solr
Enrolled
- S01 | C01
- S01 | C02
- S02 | C01
- S02 | C02
Students
- S01 | Marina | [Hadoop, Solr]
- S02 | Ben | [Hadoop, Solr]
Search Concepts & Tools
Search Concepts & Tools
Wrap it up...
 Design for the user
 Know your data
 Be lazy
Questions?
Solr Hack Session...
Solr Schema
 Field Definitions
o Field Type, Indexed, Stored,
Multivalued, Doc Values
o Copy Fields
o Dynamic Fields
則р <dynamicField name="*_sort" type="lowercase" />
 Field Types
o Analysis Chain
Solr Config
 Request handler definitions
 Search component definitions
 Update processor chains
 Cache settings
 Index specifications
 Threshold settings
 Custom library import locations

More Related Content

Search Concepts & Tools

  • 1. Search Concepts and Tools Steve Davids
  • 2. What is Search? Image Credit: https://www.flickr.com/photos/doggybytes/ - CC License
  • 3. What is Search? Definition: try to find something by looking or otherwise seeking carefully and thoroughly Synonyms: hunt, look, quest
  • 4. Hunt SELECT * FROM book WHERE author LIKE %name% publisher: (Wall Street Journal OR WSJ.com OR Wall Street Journal India...)
  • 5. Look Relevant Results Use signals to boost relevance Ability to quickly whittle down results
  • 7. Hunt Look Quest What do we have in common?
  • 8. Open Source Search Library (Java API) Free text search Relevancy ranking Faceting and filtering Hit-term highlighting Near real-time indexing/querying Inverted Index
  • 9. Free text search via Keyword Wildcard Proximity Fuzzy Range Geospatial
  • 10. Free text search via Keyword Wildcard Proximity Fuzzy Range Geospatial walk* M?ham?d M[ou]hamm?[ae]d
  • 11. Free text search via Keyword Wildcard Proximity Fuzzy Range Geospatial
  • 12. Free text search via Keyword Wildcard Proximity Fuzzy Range Geospatial
  • 13. Free text search via Keyword Wildcard Proximity Fuzzy Range Geospatial [* TO N]
  • 14. Free text search via Keyword Wildcard Proximity Fuzzy Range Geospatial
  • 15. Text Analysis Convert text into searchable words CharFilter o Mutates single stream of text Tokenizer o Splits single stream of text into one or more tokens TokenFilter o Mutates token stream
  • 16. Notable Character Filters HTML Strip o <p>Example <a href=/test>link</a></p> o Example link Pattern Replace o pattern="[^a-zA-Z]" replacement="" o Testing123 o Testing
  • 17. Notable Tokenizers Keyword o Hello World! o {Hello World!} Whitespace o Hello World! o {Hello, World!} Standard Pattern ICU (International Components for Unicode)
  • 18. Notable Token Filters Lower Case o {Hello, World!} o {hello, world!} Synonym o synonyms.txt (expand=true): JPN, Japan, JN 則р {to, Japan} 則р {to, {Japan, JPN, JN}} o synonyms.txt (expand=false) 則р {to, JPN}
  • 19. Notable Token Filters Word Delimiter o {F22-Raptor} o {F22, Raptor} o {F, 22, Raptor} o {F, {22, F22}, {Raptor, F22Raptor}} Porter Stem o {walked, walking, walks} o {walk, walk, walk}
  • 20. Inverted Index T[0] = "It is what it is" T[1] = "what is it?" T[2] = "it is a banana" "a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "It": {0} "it?": {1} "what": {0, 1}
  • 21. Inverted Index T[0] = "It is what it is" T[1] = "what is it?" T[2] = "it is a banana" "a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1}
  • 22. What is Relevant? TF-IDF o Term Frequency - Inverse Document Frequency Boosting o Important terms o Signals
  • 23. Relational vs Denormalized Students - S01 | Marina - S02 | Ben Classes - C01 | Hadoop - C02 | Solr Enrolled - S01 | C01 - S01 | C02 - S02 | C01 - S02 | C02 Students - S01 | Marina | [Hadoop, Solr] - S02 | Ben | [Hadoop, Solr]
  • 26. Wrap it up... Design for the user Know your data Be lazy
  • 29. Solr Schema Field Definitions o Field Type, Indexed, Stored, Multivalued, Doc Values o Copy Fields o Dynamic Fields 則р <dynamicField name="*_sort" type="lowercase" /> Field Types o Analysis Chain
  • 30. Solr Config Request handler definitions Search component definitions Update processor chains Cache settings Index specifications Threshold settings Custom library import locations