The document discusses why a dedicated search engine like Elasticsearch is better than a traditional database for search tasks. It explains that databases are optimized for data storage and retrieval by unique IDs, but are slow and inefficient for full text search. Elasticsearch uses an inverted index which allows it to quickly search text fields and return relevant results. It analyzes, normalizes, and indexes documents upfront so queries can be executed rapidly against the index. Ranking algorithms ensure the most relevant documents are prioritized in results.
3. Why Search?
What does a dedicated search engine do?
o that a database doesnt?
Why not [MySQL|mongoDB|Cassandra | etc]?
Why a dedicated search engine?
OpenSource Connections
4. Why not MySQL?
Weve got rows of stuff in tables. IE for SciFi
StackExchange, weve stored ~20K posts:
PostID
UserId
CreationDate
ViewCount
Body
0
1
2011-01124
11T20:52:46.75
3
<p>What exactly
did Obiwan know
about Anakin and
Darth Vader before
a New Hope
started?</p>
1
2
2013-02525
01T12:44:46.52
5
<p>Been meaning
to read the
Foundation Series,
what should I read
first?</p>
OpenSource Connections
5. Why not MySQL?
Our mission: Find all the Darth Vader in SciFi
StackExchange Posts!
P U C V Body
0 1 2 1 <p>What exactly
did Obiwan know
about Anakin and
Darth Vader before
a New Hope
started?</p>
1 2 2 5 <p>Been meaning
to read the
Foundation Series,
what should I read
first?</p>
Found!
Missing!
OpenSource Connections
6. Why not MySQL SQL Like?
SQL LIKE operator scan all rows for a specific
wildcard match
SELECT * FROM posts WHERE body LIKE "%darth vader%"
Performs Table Scan
Match?
Match?
Match?
Match?
Approx 300ms to search a measly 20K docs!
(what if we had 20 Million?)
OpenSource Connections
7. SQL Like other problems
Cant search for words out of-order:
SELECT * FROM posts WHERE body LIKE "%vader,
darth%"
0 results
Cant search for alternate forms of a word:
SELECT * FROM posts WHERE body LIKE "%kittie
pictures%
SELECT * FROM posts WHERE body LIKE "%kitteh
pictures%"
OpenSource Connections
8. SQL Like other problems
No Ranking of Results given these two docs:
I seem to remember a novel, I
think it was Dark Lord: The
Rise of Darth Vader, that
addressed this. It made the
assertion that while Darth
Vader had lost both hands, he
was still as formidable, in the
force sense,
- Directly about Darth Vader
One might ask how none of the Jedi
at Qui-Gon's funeral noticed that
there was a Dark Lord of the Sith
standing right behind them. Darth
Vader and Obi-Wan only noticed
each other when on the same station
It's apparently hard to pick up
another force-user without knowing
he or she is there
- Darth Vader is a side topic here
Which should come first?
OpenSource Connections
9. SQL Like| CTRL+F |grep is
1. Extremely Slow
2. Not fuzzy -- Needs exact literal matches, no
fuzziness!
3. Unranked -- Simply says y/n whether there is a
match
OpenSource Connections
10. Search needs to be
1. FAST! A data structure that can efficiently take
search terms and return a set of documents
2. FUZZY! A way to record positional and fuzzy
modifications to text to assist matching
3. FRUITFUL! Relevant documents bubble to the top.
OpenSource Connections
11. Lets play with an implementation
Your databases full text search features
o MySQL, for example has a FULLTEXT index
o Works for trivial cases, not the path of wisdom
Lucene -> Elasticsearch
Lucene
Solr
Elasticsearch
Lucene, 1999 by Doug Cutting
Java library for search
Solr, 2006, Yonik Seely
First to put Lucene behind an
http interface
Still going strong
Elasticsearch, 2010, Shay Banon
Alternative implementation
Extremely REST-Y
OpenSource Connections
12. Elasticsearch
Create an index
curl XPUT http://localhost:9200/stackexchange
Index some docs!
curl XPUT http://localhost:9200/stackexchange/post/1 -d {
Body: <p>Darth Vader dined with Luke</p>,
Title: ...}
OpenSource Connections
13. What is being built?
The answer can be found in your textbook
Book Index:
Topics -> page no
Very efficient tool compare to
scanning the whole book!
Lucene uses an index:
Tokens => document ids:
laser => [2, 4]
light => [2, 5]
lightsaber => [0, 1, 5, 7]
OpenSource Connections
14. Computers == Dumb
Humans are smart
o I see cat or cats in the back of a book, no duh jump
to page 9
Computers are dumb,
o CAT != cat no match returned
o cat != cats no match returned
Hence, when indexing, normalize text to more
searchable form:
cats -> cat
fitted -> fit
alumnus -> alumnu
OpenSource Connections
15. Normalization aka Text Analysis
Raw input Filtered (char filter)
<p>Darth Vader dined with Luke</p>
Darth Vader dined with Luke
Tokenized,
o Darth Vader dined with Luke
o [Darth] [Vader] [dined] [with] [Luke]
Token filters (Lowercased, synonyms applied,
remove pointless words)
o [darth] [vader] [dine] [luke]
Most importantly: this is highly configurable
OpenSource Connections
17. What is being built?
field Body
term darth
doc 1
<metadata>
doc 2
<metadata>
term vader
doc 1
<metadata>
term dine
doc 1
<metadata>
curl XPUT http://localhost:9200/stackexchange/post/1 d {
Body: <p>Darth Vader dined with Luke</p>,
Title: ...}
curl XPUT http://localhost:9200/stackexchange/post/2 d {
Body: <p>We love Darth</p>,
Title: ...}
OpenSource Connections
18. Ranking
field Body
term darth
doc 1
<metadata>
doc 2
<metadata>
term vader
doc 1
<metadata>
term dine
doc 1
<metadata>
curl XPUT http://localhost:9200/stackexchange/post/1 d {
Body: <p>Darth Vader dined with Luke</p>,
Title: ...}
curl XPUT http://localhost:9200/stackexchange/post/2 d {
Body: <p>We love Darth</p>,
Title: ...}
Can we store anything here to
help decide how relevant this
term is for this doc?
Yes!
- Term Frequency
- How much darth is in
this doc?
- Position within document
- Helps when we search for
the phrase darth vader
OpenSource Connections
19. Query Documents
When did Darth Vader and Luke have dinner?
curl -X POST "http://localhost:9200/stackexchange/_search?pretty=true"
-d '
{
"query": {
"match": {
"Body": "luke darth dinner"
}
User Query
}
}
OpenSource Connections
20. What happens when we query?
luke darth dinner
How to consult
index for matches?
[darth]
Analysis
[luke]
[darth]
[dine]
Score for [darth]
docs (1 and 2)
[dine]
Score for [dine]
docs (1)
Return sorted
docs client
field Body
term darth
doc 1
<metadata>
doc 2
<metadata>
term vader
doc 1
<metadata>
term dine
doc 1
<metadata>
...
OpenSource Connections
21. So Elasticsearch!
FAST!
o Inverted index data structure is blazing fast
o Lucene is probably the most tuned implementation
FUZZY!
o We use analysis to normalize text to canonical forms
o We can use positional information when querying (not
shown here)
FRUITFUL!
o Relevant documents are scored based on relative term
frequency
OpenSource Connections
22. BUT WAIT THERES MORE
Many non-traditional applications of search
o Rank file directory by proximity to current directory
o Geographic-aided search, rank based on distance and
search relevancy
o Q & A systems Watson has a ton of Lucene
o Log aggregation, ie Kibana -- because in Lucene
everything is indexed!
And many features!
o Spellchecking
o Facets
o More-like-this document
OpenSource Connections