�ݺ�ߣ

Why Search?
(starring Elasticsearch)
Doug Turnbull
OpenSource Connections


Hello
• Me
@softwaredoug
dturnbull@o19s.com
• Us
http://o19s.com
World class search consultants
Right here in C’ville!
Hiring passionate interns!

Why Search?
• What does a dedicated search engine do?
o that a database doesn’t?

• Why not [MySQL|mongoDB|Cassandra | etc]?
• Why a dedicated search engine?


Why not MySQL?
• We’ve got rows of stuff in tables. IE for SciFi
StackExchange, we’ve stored ~20K posts:
PostID

UserId

CreationDate

ViewCount

Body

0

1

2011-01124
11T20:52:46.75
3

What exactly
did Obiwan know
about Anakin and
Darth Vader before
a New Hope
started?

1

2

2013-02525
01T12:44:46.52
5

Been meaning
to read the
Foundation Series,
what should I read
first?

Why not MySQL?
• Our mission: Find all the “Darth Vader” in SciFi
StackExchange Posts!
P U C V Body
0 1 2 1 What exactly
did Obiwan know
about Anakin and
Darth Vader before
a New Hope
started?

1 2 2 5 Been meaning
to read the
Foundation Series,
what should I read
first?

Found!

Missing!

Why not MySQL – SQL Like?
• SQL “LIKE” operator – scan all rows for a specific
wildcard match
SELECT * FROM posts WHERE body LIKE "%darth vader%"
Performs Table Scan
Match?
Match?

Match?
Match?
Approx 300ms to search a measly 20K docs!
(what if we had 20 Million?)

SQL Like – other problems
• Can’t search for words out –of-order:
SELECT * FROM posts WHERE body LIKE "%vader,
darth%"
0 results
• Can’t search for alternate forms of a word:

SELECT * FROM posts WHERE body LIKE "%kittie
pictures%‚
SELECT * FROM posts WHERE body LIKE "%kitteh
pictures%"


SQL Like – other problems
• No Ranking of Results – given these two docs:

I seem to remember a novel, I
think it was Dark Lord: The
Rise of Darth Vader, that
addressed this. It made the
assertion that while Darth
Vader had lost both hands, he
was still as formidable, in the
force sense,

- Directly about Darth Vader

One might ask how none of the Jedi
at Qui-Gon's funeral noticed that
there was a Dark Lord of the Sith
standing right behind them. Darth
Vader and Obi-Wan only noticed
each other when on the same station
… It's apparently hard to pick up
another force-user without knowing
he or she is there…

- Darth Vader is a side topic here

Which should come first?

SQL Like| CTRL+F |grep is
1. Extremely Slow

2. Not fuzzy -- Needs exact literal matches, no
fuzziness!

3. Unranked -- Simply says y/n whether there is a
match


Search needs to be
1. FAST! A data structure that can efficiently take
search terms and return a set of documents

2. FUZZY! A way to record positional and fuzzy
modifications to text to assist matching

3. FRUITFUL! Relevant documents bubble to the top.


Lets play with an implementation
• Your database’s full text search features
o MySQL, for example has a FULLTEXT index
o Works for trivial cases, not the path of wisdom

• Lucene -> Elasticsearch
Lucene

Solr
Elasticsearch

• Lucene, 1999 by Doug Cutting
• Java library for search
• Solr, 2006, Yonik Seely
• First to put Lucene behind an
http interface
• Still going strong
• Elasticsearch, 2010, Shay Banon
• Alternative implementation
• Extremely REST-Y

Elasticsearch
• Create an index

curl –XPUT http://localhost:9200/stackexchange
• Index some docs!
curl –XPUT http://localhost:9200/stackexchange/post/1 -d ‘{
‚Body‛: ‚Darth Vader dined with Luke‛,
‚Title‛: ‚...‛}’

What is being built?
The answer can be found in your textbook…
Book Index:
• Topics -> page no
• Very efficient tool – compare to
scanning the whole book!

Lucene uses an index:
• Tokens => document ids:
laser => [2, 4]
light => [2, 5]
lightsaber => [0, 1, 5, 7]


Computers == Dumb
• Humans are smart
o I see “cat” or “cats” in the back of a book, no duh – jump
to page 9

• Computers are dumb,
o “CAT” != “cat” – no match returned
o “cat” != “cats” – no match returned

• Hence, when indexing, normalize text to more
searchable form:
cats -> cat
fitted -> fit
alumnus -> alumnu


Normalization aka Text Analysis
• Raw input Filtered (char filter)
•
•

Darth Vader dined with Luke
Darth Vader dined with Luke

• Tokenized,
o Darth Vader dined with Luke
o [Darth] [Vader] [dined] [with] [Luke]

• Token filters (Lowercased, synonyms applied,
remove pointless words)
o [darth] [vader] [dine] [luke]

• Most importantly: this is highly configurable

Normalization aka Text Analysis
curl -XGET 'http://localhost:9200/_analyze?analyzer=snowball' -d 'Darth Vader dined
with Luke‘
{
"tokens": [
{
"end_offset": 5,
"position": 1,
"start_offset": 0,
"token": "darth",
"type": "<ALPHANUM>"
},
{
"end_offset": 11,
"position": 2,
"start_offset": 6,
"token": "vader",
},
{
"end_offset": 17,
"position": 3,
"start_offset": 12,
"token": "dine",
},
{
"end_offset": 27,
"position": 5,
"start_offset": 23,
"token": "luke",
}
]
}

What is being built?
field Body
term darth
doc 1
<metadata>
doc 2
<metadata>
term vader
doc 1
<metadata>
term dine
doc 1
<metadata>

curl –XPUT http://localhost:9200/stackexchange/post/1 –d ‘{
‚Title‛: ‚...‛}’
‚Body‛: ‚We love Darth‛,
‚Title‛: ‚...‛}’

Ranking
field Body
term darth
doc 1
<metadata>
doc 2
<metadata>
term vader
doc 1
<metadata>
term dine
doc 1
<metadata>

‚Title‛: ‚...‛}’
‚Body‛: ‚We love Darth‛,
‚Title‛: ‚...‛}’

Can we store anything here to
help decide how relevant this
term is for this doc?

Yes!
- Term Frequency
- How much “darth” is in
this doc?
- Position within document
- Helps when we search for
the phrase “darth vader”

Query Documents
• When did Darth Vader and Luke have dinner?
curl -X POST "http://localhost:9200/stackexchange/_search?pretty=true"
-d '
{
"query": {
"match": {
"Body": "luke darth dinner"
}
User Query
}
}


What happens when we query?
luke darth dinner

How to consult
index for matches?
[darth]

Analysis

[luke]
[darth]
[dine]

Score for [darth]
docs (1 and 2)

[dine]
Score for [dine]
docs (1)

Return sorted
docs client

field Body
term darth
doc 1
<metadata>
doc 2
<metadata>
term vader
doc 1
<metadata>
term dine
doc 1
<metadata>

...

So Elasticsearch!
• FAST!
o Inverted index data structure is blazing fast
o Lucene is probably the most tuned implementation

• FUZZY!
o We use analysis to normalize text to canonical forms
o We can use positional information when querying (not
shown here)

• FRUITFUL!
o Relevant documents are scored based on relative term
frequency


BUT WAIT THERE’S MORE
• Many non-traditional applications of “search”
o Rank file directory by proximity to current directory
o Geographic-aided search, rank based on distance and
search relevancy
o Q & A systems – Watson has a ton of Lucene
o Log aggregation, ie Kibana -- because in Lucene
everything is indexed!

• And many features!
o Spellchecking
o Facets
o More-like-this document


QUESTIONS?


�ݺ�ߣ

Why Search? (starring Elasticsearch)

More Related Content

Why Search? (starring Elasticsearch)