際際滷

際際滷Share a Scribd company logo
Why Search?
(starring Elasticsearch)
Doug Turnbull
OpenSource Connections

OpenSource Connections
Hello
 Me
@softwaredoug
dturnbull@o19s.com
 Us
http://o19s.com
World class search consultants
Right here in Cville!
Hiring passionate interns!
OpenSource Connections
Why Search?
 What does a dedicated search engine do?
o that a database doesnt?

 Why not [MySQL|mongoDB|Cassandra | etc]?
 Why a dedicated search engine?

OpenSource Connections
Why not MySQL?
 Weve got rows of stuff in tables. IE for SciFi
StackExchange, weve stored ~20K posts:
PostID

UserId

CreationDate

ViewCount

Body

0

1

2011-01124
11T20:52:46.75
3

<p>What exactly
did Obiwan know
about Anakin and
Darth Vader before
a New Hope
started?</p>

1

2

2013-02525
01T12:44:46.52
5

<p>Been meaning
to read the
Foundation Series,
what should I read
first?</p>

OpenSource Connections
Why not MySQL?
 Our mission: Find all the Darth Vader in SciFi
StackExchange Posts!
P U C V Body
0 1 2 1 <p>What exactly
did Obiwan know
about Anakin and
Darth Vader before
a New Hope
started?</p>

1 2 2 5 <p>Been meaning
to read the
Foundation Series,
what should I read
first?</p>

Found!

Missing!

OpenSource Connections
Why not MySQL  SQL Like?
 SQL LIKE operator  scan all rows for a specific
wildcard match
SELECT * FROM posts WHERE body LIKE "%darth vader%"
Performs Table Scan
Match?
Match?

Match?
Match?
Approx 300ms to search a measly 20K docs!
(what if we had 20 Million?)
OpenSource Connections
SQL Like  other problems
 Cant search for words out of-order:
SELECT * FROM posts WHERE body LIKE "%vader,
darth%"
0 results
 Cant search for alternate forms of a word:

SELECT * FROM posts WHERE body LIKE "%kittie
pictures%
SELECT * FROM posts WHERE body LIKE "%kitteh
pictures%"

OpenSource Connections
SQL Like  other problems
 No Ranking of Results  given these two docs:

I seem to remember a novel, I
think it was Dark Lord: The
Rise of Darth Vader, that
addressed this. It made the
assertion that while Darth
Vader had lost both hands, he
was still as formidable, in the
force sense,

- Directly about Darth Vader

One might ask how none of the Jedi
at Qui-Gon's funeral noticed that
there was a Dark Lord of the Sith
standing right behind them. Darth
Vader and Obi-Wan only noticed
each other when on the same station
 It's apparently hard to pick up
another force-user without knowing
he or she is there

- Darth Vader is a side topic here

Which should come first?
OpenSource Connections
SQL Like| CTRL+F |grep is
1. Extremely Slow

2. Not fuzzy -- Needs exact literal matches, no
fuzziness!

3. Unranked -- Simply says y/n whether there is a
match

OpenSource Connections
Search needs to be
1. FAST! A data structure that can efficiently take
search terms and return a set of documents

2. FUZZY! A way to record positional and fuzzy
modifications to text to assist matching

3. FRUITFUL! Relevant documents bubble to the top.

OpenSource Connections
Lets play with an implementation
 Your databases full text search features
o MySQL, for example has a FULLTEXT index
o Works for trivial cases, not the path of wisdom

 Lucene -> Elasticsearch
Lucene

Solr
Elasticsearch

 Lucene, 1999 by Doug Cutting
 Java library for search
 Solr, 2006, Yonik Seely
 First to put Lucene behind an
http interface
 Still going strong
 Elasticsearch, 2010, Shay Banon
 Alternative implementation
 Extremely REST-Y
OpenSource Connections
Elasticsearch
 Create an index

curl XPUT http://localhost:9200/stackexchange
 Index some docs!
curl XPUT http://localhost:9200/stackexchange/post/1 -d {
Body: <p>Darth Vader dined with Luke</p>,
Title: ...}

OpenSource Connections
What is being built?
The answer can be found in your textbook
Book Index:
 Topics -> page no
 Very efficient tool  compare to
scanning the whole book!

Lucene uses an index:
 Tokens => document ids:
laser => [2, 4]
light => [2, 5]
lightsaber => [0, 1, 5, 7]

OpenSource Connections
Computers == Dumb
 Humans are smart
o I see cat or cats in the back of a book, no duh  jump
to page 9

 Computers are dumb,
o CAT != cat  no match returned
o cat != cats  no match returned

 Hence, when indexing, normalize text to more
searchable form:
cats -> cat
fitted -> fit
alumnus -> alumnu

OpenSource Connections
Normalization aka Text Analysis
 Raw input Filtered (char filter)



<p>Darth Vader dined with Luke</p>
Darth Vader dined with Luke

 Tokenized,
o Darth Vader dined with Luke
o [Darth] [Vader] [dined] [with] [Luke]

 Token filters (Lowercased, synonyms applied,
remove pointless words)
o [darth] [vader] [dine] [luke]

 Most importantly: this is highly configurable
OpenSource Connections
Normalization aka Text Analysis
curl -XGET 'http://localhost:9200/_analyze?analyzer=snowball' -d 'Darth Vader dined
with Luke
{
"tokens": [
{
"end_offset": 5,
"position": 1,
"start_offset": 0,
"token": "darth",
"type": "<ALPHANUM>"
},
{
"end_offset": 11,
"position": 2,
"start_offset": 6,
"token": "vader",
"type": "<ALPHANUM>"
},
{
"end_offset": 17,
"position": 3,
"start_offset": 12,
"token": "dine",
"type": "<ALPHANUM>"
},
{
"end_offset": 27,
"position": 5,
"start_offset": 23,
"token": "luke",
"type": "<ALPHANUM>"
}
]
}

OpenSource Connections
What is being built?
field Body
term darth
doc 1
<metadata>
doc 2
<metadata>
term vader
doc 1
<metadata>
term dine
doc 1
<metadata>

curl XPUT http://localhost:9200/stackexchange/post/1 d {
Body: <p>Darth Vader dined with Luke</p>,
Title: ...}
curl XPUT http://localhost:9200/stackexchange/post/2 d {
Body: <p>We love Darth</p>,
Title: ...}

OpenSource Connections
Ranking
field Body
term darth
doc 1
<metadata>
doc 2
<metadata>
term vader
doc 1
<metadata>
term dine
doc 1
<metadata>

curl XPUT http://localhost:9200/stackexchange/post/1 d {
Body: <p>Darth Vader dined with Luke</p>,
Title: ...}
curl XPUT http://localhost:9200/stackexchange/post/2 d {
Body: <p>We love Darth</p>,
Title: ...}

Can we store anything here to
help decide how relevant this
term is for this doc?

Yes!
- Term Frequency
- How much darth is in
this doc?
- Position within document
- Helps when we search for
the phrase darth vader
OpenSource Connections
Query Documents
 When did Darth Vader and Luke have dinner?
curl -X POST "http://localhost:9200/stackexchange/_search?pretty=true"
-d '
{
"query": {
"match": {
"Body": "luke darth dinner"
}
User Query
}
}

OpenSource Connections
What happens when we query?
luke darth dinner

How to consult
index for matches?
[darth]

Analysis

[luke]
[darth]
[dine]

Score for [darth]
docs (1 and 2)

[dine]
Score for [dine]
docs (1)

Return sorted
docs client

field Body
term darth
doc 1
<metadata>
doc 2
<metadata>
term vader
doc 1
<metadata>
term dine
doc 1
<metadata>

...
OpenSource Connections
So Elasticsearch!
 FAST!
o Inverted index data structure is blazing fast
o Lucene is probably the most tuned implementation

 FUZZY!
o We use analysis to normalize text to canonical forms
o We can use positional information when querying (not
shown here)

 FRUITFUL!
o Relevant documents are scored based on relative term
frequency

OpenSource Connections
BUT WAIT THERES MORE
 Many non-traditional applications of search
o Rank file directory by proximity to current directory
o Geographic-aided search, rank based on distance and
search relevancy
o Q & A systems  Watson has a ton of Lucene
o Log aggregation, ie Kibana -- because in Lucene
everything is indexed!

 And many features!
o Spellchecking
o Facets
o More-like-this document

OpenSource Connections
QUESTIONS?

OpenSource Connections

More Related Content

Why Search? (starring Elasticsearch)

  • 1. Why Search? (starring Elasticsearch) Doug Turnbull OpenSource Connections OpenSource Connections
  • 2. Hello Me @softwaredoug dturnbull@o19s.com Us http://o19s.com World class search consultants Right here in Cville! Hiring passionate interns! OpenSource Connections
  • 3. Why Search? What does a dedicated search engine do? o that a database doesnt? Why not [MySQL|mongoDB|Cassandra | etc]? Why a dedicated search engine? OpenSource Connections
  • 4. Why not MySQL? Weve got rows of stuff in tables. IE for SciFi StackExchange, weve stored ~20K posts: PostID UserId CreationDate ViewCount Body 0 1 2011-01124 11T20:52:46.75 3 <p>What exactly did Obiwan know about Anakin and Darth Vader before a New Hope started?</p> 1 2 2013-02525 01T12:44:46.52 5 <p>Been meaning to read the Foundation Series, what should I read first?</p> OpenSource Connections
  • 5. Why not MySQL? Our mission: Find all the Darth Vader in SciFi StackExchange Posts! P U C V Body 0 1 2 1 <p>What exactly did Obiwan know about Anakin and Darth Vader before a New Hope started?</p> 1 2 2 5 <p>Been meaning to read the Foundation Series, what should I read first?</p> Found! Missing! OpenSource Connections
  • 6. Why not MySQL SQL Like? SQL LIKE operator scan all rows for a specific wildcard match SELECT * FROM posts WHERE body LIKE "%darth vader%" Performs Table Scan Match? Match? Match? Match? Approx 300ms to search a measly 20K docs! (what if we had 20 Million?) OpenSource Connections
  • 7. SQL Like other problems Cant search for words out of-order: SELECT * FROM posts WHERE body LIKE "%vader, darth%" 0 results Cant search for alternate forms of a word: SELECT * FROM posts WHERE body LIKE "%kittie pictures% SELECT * FROM posts WHERE body LIKE "%kitteh pictures%" OpenSource Connections
  • 8. SQL Like other problems No Ranking of Results given these two docs: I seem to remember a novel, I think it was Dark Lord: The Rise of Darth Vader, that addressed this. It made the assertion that while Darth Vader had lost both hands, he was still as formidable, in the force sense, - Directly about Darth Vader One might ask how none of the Jedi at Qui-Gon's funeral noticed that there was a Dark Lord of the Sith standing right behind them. Darth Vader and Obi-Wan only noticed each other when on the same station It's apparently hard to pick up another force-user without knowing he or she is there - Darth Vader is a side topic here Which should come first? OpenSource Connections
  • 9. SQL Like| CTRL+F |grep is 1. Extremely Slow 2. Not fuzzy -- Needs exact literal matches, no fuzziness! 3. Unranked -- Simply says y/n whether there is a match OpenSource Connections
  • 10. Search needs to be 1. FAST! A data structure that can efficiently take search terms and return a set of documents 2. FUZZY! A way to record positional and fuzzy modifications to text to assist matching 3. FRUITFUL! Relevant documents bubble to the top. OpenSource Connections
  • 11. Lets play with an implementation Your databases full text search features o MySQL, for example has a FULLTEXT index o Works for trivial cases, not the path of wisdom Lucene -> Elasticsearch Lucene Solr Elasticsearch Lucene, 1999 by Doug Cutting Java library for search Solr, 2006, Yonik Seely First to put Lucene behind an http interface Still going strong Elasticsearch, 2010, Shay Banon Alternative implementation Extremely REST-Y OpenSource Connections
  • 12. Elasticsearch Create an index curl XPUT http://localhost:9200/stackexchange Index some docs! curl XPUT http://localhost:9200/stackexchange/post/1 -d { Body: <p>Darth Vader dined with Luke</p>, Title: ...} OpenSource Connections
  • 13. What is being built? The answer can be found in your textbook Book Index: Topics -> page no Very efficient tool compare to scanning the whole book! Lucene uses an index: Tokens => document ids: laser => [2, 4] light => [2, 5] lightsaber => [0, 1, 5, 7] OpenSource Connections
  • 14. Computers == Dumb Humans are smart o I see cat or cats in the back of a book, no duh jump to page 9 Computers are dumb, o CAT != cat no match returned o cat != cats no match returned Hence, when indexing, normalize text to more searchable form: cats -> cat fitted -> fit alumnus -> alumnu OpenSource Connections
  • 15. Normalization aka Text Analysis Raw input Filtered (char filter) <p>Darth Vader dined with Luke</p> Darth Vader dined with Luke Tokenized, o Darth Vader dined with Luke o [Darth] [Vader] [dined] [with] [Luke] Token filters (Lowercased, synonyms applied, remove pointless words) o [darth] [vader] [dine] [luke] Most importantly: this is highly configurable OpenSource Connections
  • 16. Normalization aka Text Analysis curl -XGET 'http://localhost:9200/_analyze?analyzer=snowball' -d 'Darth Vader dined with Luke { "tokens": [ { "end_offset": 5, "position": 1, "start_offset": 0, "token": "darth", "type": "<ALPHANUM>" }, { "end_offset": 11, "position": 2, "start_offset": 6, "token": "vader", "type": "<ALPHANUM>" }, { "end_offset": 17, "position": 3, "start_offset": 12, "token": "dine", "type": "<ALPHANUM>" }, { "end_offset": 27, "position": 5, "start_offset": 23, "token": "luke", "type": "<ALPHANUM>" } ] } OpenSource Connections
  • 17. What is being built? field Body term darth doc 1 <metadata> doc 2 <metadata> term vader doc 1 <metadata> term dine doc 1 <metadata> curl XPUT http://localhost:9200/stackexchange/post/1 d { Body: <p>Darth Vader dined with Luke</p>, Title: ...} curl XPUT http://localhost:9200/stackexchange/post/2 d { Body: <p>We love Darth</p>, Title: ...} OpenSource Connections
  • 18. Ranking field Body term darth doc 1 <metadata> doc 2 <metadata> term vader doc 1 <metadata> term dine doc 1 <metadata> curl XPUT http://localhost:9200/stackexchange/post/1 d { Body: <p>Darth Vader dined with Luke</p>, Title: ...} curl XPUT http://localhost:9200/stackexchange/post/2 d { Body: <p>We love Darth</p>, Title: ...} Can we store anything here to help decide how relevant this term is for this doc? Yes! - Term Frequency - How much darth is in this doc? - Position within document - Helps when we search for the phrase darth vader OpenSource Connections
  • 19. Query Documents When did Darth Vader and Luke have dinner? curl -X POST "http://localhost:9200/stackexchange/_search?pretty=true" -d ' { "query": { "match": { "Body": "luke darth dinner" } User Query } } OpenSource Connections
  • 20. What happens when we query? luke darth dinner How to consult index for matches? [darth] Analysis [luke] [darth] [dine] Score for [darth] docs (1 and 2) [dine] Score for [dine] docs (1) Return sorted docs client field Body term darth doc 1 <metadata> doc 2 <metadata> term vader doc 1 <metadata> term dine doc 1 <metadata> ... OpenSource Connections
  • 21. So Elasticsearch! FAST! o Inverted index data structure is blazing fast o Lucene is probably the most tuned implementation FUZZY! o We use analysis to normalize text to canonical forms o We can use positional information when querying (not shown here) FRUITFUL! o Relevant documents are scored based on relative term frequency OpenSource Connections
  • 22. BUT WAIT THERES MORE Many non-traditional applications of search o Rank file directory by proximity to current directory o Geographic-aided search, rank based on distance and search relevancy o Q & A systems Watson has a ton of Lucene o Log aggregation, ie Kibana -- because in Lucene everything is indexed! And many features! o Spellchecking o Facets o More-like-this document OpenSource Connections