Presentation given on NYC software engineering group [http://www.meetup.nycsoftware.org/events/102941112/] on Feb 07 2013.
An overview of the implementation of ElasticSearch as the new search and browse engine at emusic.com.
This talk shows the challenges that the team faced while putting this amazing solution to work.
By replacing a proprietary legacy Oracle Endeca product to ElasticSearch emusic was able to reduce by 400% the number of nodes used by the search engine, response times were down by 200% and with an 80% increase of traffic load.
We serve over 5 million requests a day on our search servers.
2. About
emusic.com is one of the leading digital music
retailers, is committed to serving music
enthusiasts with aggressive development of
tools and features solely designed to
enhance personalized music experience
Searching is a BIG part of emusic.com
discovery experience
3. Search challenges
Results MUST be relevant
How do we de?ne relevancy?
Low response times ( < 100ms)
High availability (Users don¡¯t tolerate not
?nding what they are looking for)
Static search results are old school, your
engine should ¡°know¡± your users preferences
How to recover from catastrophic crashes
4. Goals
Improve results relevancy : The Adele 21
problem
Get rid of proprietary software
Have an extensible search engine
Understand what¡¯s happening under the hood
Integrate with other user projects :
recommendation, af?nity, activity
5. SEABRO Project
Replace current proprietary Endeca search
engine to a modern search engine
Get more relevant results
Flexible API
Facet searching (for browsing)
Ability to scale out
3 months in execution (1st phase)
6. Why ElasticSearch?
Built from ground up to scale out
Powerful DSL
Schema-less -> JSON
Distributed Lucene
Very powerful API, allows automation of
the whole cluster using simple curl
commands
It¡¯s bonsai cool
13. Cluster with 3 nodes : adding replicas
Yellow state, not all
replicas set
14. Cluster with 3 nodes : adding replicas
Index aliasing
Primary Shard
15. Cluster with 3 nodes : node crash
Without replication, our
cluster now is missing
one shard.
16. Client Node types
Three types
Data node : joins the cluster, fetch shard
data
Client node : joins the cluster, participates
on sorting operations, don¡¯t fetch data
Transport node : Don¡¯t join the cluster,
only send requests towards it
17. Extensible architecture : plugins
Site : GUI interfaces
River : integration hook to fetch data from
other sources (DB, MQ, FS ... )
Transport : Allow different transports to be
plugged into ES core
Scripting : Allow adding new languages to
the scripting system
Analysis : Custom analyzers for indexing/
searching
Misc : You know, everything else
19. Our numbers
25+ million documents
Multi types: Songs, Albums, Artists, Audio
Books, Composers, Labels, Authors,
Conductors, Composers ...
5+ million search requests per day
~ 100 gb index. And it only takes 1 hour to
build it from ground up (thanks to Akka
engine)
22. Lesson #1
Get professional Help
elasticsearch.org is very
well documented
But when it comes to
prod, ask the experts
Get professional support
from elasticsearch.com
23. Lesson #2
Understand shards
Sharding is what make
distributable search
possible
Understanding what they
mean and how can they
speed up your engine is a
must
25. Lesson #2
Understand shards
Increasing the number of shards will boost
your query times
Each shard maps to a Lucene Index Reader/
Writer, the more power your box has, the
more shards you should have
Replication will boost cluster response time
26. Lesson # 3
Design your data ?ow ahead of your
schema
The way you model your schemas have a
deep impact on how fast your engine can
become
Don¡¯t be afraid to replicate information using
a different structure
27. Lesson #4
Master Query DSL
Just like you know SQL, you should
understand the query DSL pretty very well
Indexed Data won¡¯t ?nd itself
Understand that sometimes you must change
data representation in order to get things to
be found
28. Lesson #5
Learn at least a bit about lucene
internals
Understanding how lucene¡¯s scoring works
helps designing better queries
Elasticsearch supports custom score using
scripts
Could hurt you on performance :(
29. Lesson #6
Put slow queries to work. Use
explain
Explain gives useful information on how
documents are being scored
Slow query log will show you which queries
are actually hurting you
Sometimes its just document cache misses
30. Lesson #7
Take GC by the horns
ES nodes can demand a lot of memory
JDK still thinks its 2003 when it comes to
memory size
Memory fragmentation
Full GC times can bring your cluster to its
knees
31. Lesson #7
Take GC by the horns
Maximum 30 GB per node
Beefy machines = more nodes per machine
Changed full GC threshold to start when
memory reaches 60% -> Giving JVM plenty
of room until memory is claimed
32. Lesson #8
Caching can eat up your memory
Caching is a necessary evil but:
Field cache stores sorted and faceted data
Filter cache stores ?ltered data
Cache eviction must be controlled
33. Lesson #8
Caching can eat up your memory
Your queries and how you facet will have a
huge impact on cache size
Bigger your shard is, more memory you will
need for caching
Facet caching for multi valued ?elds in 0.20
is not optimal, take that in consideration
34. Lesson #9
Monitor your cluster
Keep an eye on your cluster
It¡¯s vital to monitor both system metrics
(CPU, memory, ?le system) but also correlate
that with query information
ES provides nice plugins like bigdesk and
paramedic. But history is vital so get
something like sematext SPM
35. Lesson #10
Distributed systems are hard
Needless to
say, but don¡¯t
expect all
that power
to come for
free
36. Lesson #11
Have an A/B testing suite ready
De?ning relevancy is hard
People have different views on relevance
Hard to explain to a user why Joe Doe does
not show up on its query results
37. Lesson # 11
Have an A/B testing suite ready
Start with a baseline search that returns ¡°relevant
enough¡± results
Give points for every record found, the higher it is
the more points it get
Sum it all, and you have your score
When updating your queries, run the suite and
check if you get better results
38. Lesson # 12
Track user interaction
Monitor how many ¡°clicks¡± your users are
executing once you changed queries
Again, your de?nition of relevant may not be
what your users expect
Adapt
39. Final words
In the end, ES proved to be a very reliable
and affordable solution
Not only we increased the quality of results
but we have also reduced the query
response times
Request time dropped over 200%. Cluster
size reduced by 400% and with a 80%
increase in load
YES We did save money and increased
quality at the same time