ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
ElasticSearch from the
        trenches



     Vinicius Carvalho / @vccarvalho
            blog.furiousbob.com
About

emusic.com is one of the leading digital music
retailers, is committed to serving music
enthusiasts with aggressive development of
tools and features solely designed to
enhance personalized music experience

Searching is a BIG part of emusic.com
discovery experience
Search challenges

Results MUST be relevant

  How do we de?ne relevancy?

Low response times ( < 100ms)

High availability (Users don¡¯t tolerate not
?nding what they are looking for)

Static search results are old school, your
engine should ¡°know¡± your users preferences

How to recover from catastrophic crashes
Goals

Improve results relevancy : The Adele 21
problem

Get rid of proprietary software

Have an extensible search engine

Understand what¡¯s happening under the hood

Integrate with other user projects :
recommendation, af?nity, activity
SEABRO Project
Replace current proprietary Endeca search
engine to a modern search engine

Get more relevant results

Flexible API

Facet searching (for browsing)

Ability to scale out

3 months in execution (1st phase)
Why ElasticSearch?


Built from ground up to scale out

Powerful DSL

Schema-less -> JSON

Distributed Lucene

Very powerful API, allows automation of
the whole cluster using simple curl
commands

It¡¯s bonsai cool
SOLR: The contender



Powered by Lucene

Just too much XML ?les to get anything
done

Uses XML

No Query DSL
Elasticsearch at a glance


client
          q1
                    node
         r1
client


                    node
         q1
client


         r1
Cluster with 1 node
                            Cluster state




                         Node name




  Cool Names

                Shards
Cluster with 2 nodes : Rebalancing




              Shard being
              relocated
Cluster with 2 nodes : stable




      Recently relocated shard
Cluster with 3 nodes : stable
Cluster with 3 nodes : adding replicas
                          Yellow state, not all
                          replicas set
Cluster with 3 nodes : adding replicas



                              Index aliasing




            Primary Shard
Cluster with 3 nodes : node crash
                             Without replication, our
                             cluster now is missing
                             one shard.
Client Node types


Three types

  Data node : joins the cluster, fetch shard
  data

  Client node : joins the cluster, participates
  on sorting operations, don¡¯t fetch data

  Transport node : Don¡¯t join the cluster,
  only send requests towards it
Extensible architecture : plugins
Site : GUI interfaces

River : integration hook to fetch data from
other sources (DB, MQ, FS ... )

Transport : Allow different transports to be
plugged into ES core

Scripting : Allow adding new languages to
the scripting system

Analysis : Custom analyzers for indexing/
searching

Misc : You know, everything else
Site plugins
Our numbers

25+ million documents

Multi types: Songs, Albums, Artists, Audio
Books, Composers, Labels, Authors,
Conductors, Composers ...

5+ million search requests per day

~ 100 gb index. And it only takes 1 hour to
build it from ground up (thanks to Akka
engine)
Indexing ?ow


                    ES cluster

            a
            k
Oracle
            k
            a
           actors
Lessons learned
Lesson #1
      Get professional Help


elasticsearch.org is very
well documented

But when it comes to
prod, ask the experts

Get professional support
from elasticsearch.com
Lesson #2
            Understand shards


Sharding is what make
distributable search
possible

Understanding what they
mean and how can they
speed up your engine is a
must
Lesson #2
Understand shards
Lesson #2
        Understand shards


Increasing the number of shards will boost
your query times

Each shard maps to a Lucene Index Reader/
Writer, the more power your box has, the
more shards you should have

Replication will boost cluster response time
Lesson # 3
Design your data ?ow ahead of your
              schema


  The way you model your schemas have a
  deep impact on how fast your engine can
  become

  Don¡¯t be afraid to replicate information using
  a different structure
Lesson #4
        Master Query DSL


Just like you know SQL, you should
understand the query DSL pretty very well

Indexed Data won¡¯t ?nd itself

Understand that sometimes you must change
data representation in order to get things to
be found
Lesson #5
Learn at least a bit about lucene
            internals



 Understanding how lucene¡¯s scoring works
 helps designing better queries

 Elasticsearch supports custom score using
 scripts

   Could hurt you on performance :(
Lesson #6
Put slow queries to work. Use
           explain


Explain gives useful information on how
documents are being scored

Slow query log will show you which queries
are actually hurting you

  Sometimes its just document cache misses
Lesson #7
      Take GC by the horns


ES nodes can demand a lot of memory

JDK still thinks its 2003 when it comes to
memory size

Memory fragmentation

Full GC times can bring your cluster to its
knees
Lesson #7
     Take GC by the horns


Maximum 30 GB per node

Beefy machines = more nodes per machine

Changed full GC threshold to start when
memory reaches 60% -> Giving JVM plenty
of room until memory is claimed
Lesson #8
   Caching can eat up your memory




Caching is a necessary evil but:

  Field cache stores sorted and faceted data

  Filter cache stores ?ltered data

Cache eviction must be controlled
Lesson #8
Caching can eat up your memory



Your queries and how you facet will have a
huge impact on cache size

Bigger your shard is, more memory you will
need for caching

Facet caching for multi valued ?elds in 0.20
is not optimal, take that in consideration
Lesson #9
       Monitor your cluster

Keep an eye on your cluster

It¡¯s vital to monitor both system metrics
(CPU, memory, ?le system) but also correlate
that with query information

ES provides nice plugins like bigdesk and
paramedic. But history is vital so get
something like sematext SPM
Lesson #10
 Distributed systems are hard



Needless to
say, but don¡¯t
expect all
that power
to come for
free
Lesson #11
Have an A/B testing suite ready



De?ning relevancy is hard

People have different views on relevance

Hard to explain to a user why Joe Doe does
not show up on its query results
Lesson # 11
Have an A/B testing suite ready

Start with a baseline search that returns ¡°relevant
enough¡± results

Give points for every record found, the higher it is
the more points it get

Sum it all, and you have your score

When updating your queries, run the suite and
check if you get better results
Lesson # 12
        Track user interaction


Monitor how many ¡°clicks¡± your users are
executing once you changed queries

Again, your de?nition of relevant may not be
what your users expect

Adapt
Final words
In the end, ES proved to be a very reliable
and affordable solution

Not only we increased the quality of results
but we have also reduced the query
response times

Request time dropped over 200%. Cluster
size reduced by 400% and with a 80%
increase in load

YES We did save money and increased
quality at the same time
Next steps
Classify data
Classify data during indexing time instead of
using custom scripts
Search +
Recommendation
Click analysis

More Related Content

Elastic search from the trenches

  • 1. ElasticSearch from the trenches Vinicius Carvalho / @vccarvalho blog.furiousbob.com
  • 2. About emusic.com is one of the leading digital music retailers, is committed to serving music enthusiasts with aggressive development of tools and features solely designed to enhance personalized music experience Searching is a BIG part of emusic.com discovery experience
  • 3. Search challenges Results MUST be relevant How do we de?ne relevancy? Low response times ( < 100ms) High availability (Users don¡¯t tolerate not ?nding what they are looking for) Static search results are old school, your engine should ¡°know¡± your users preferences How to recover from catastrophic crashes
  • 4. Goals Improve results relevancy : The Adele 21 problem Get rid of proprietary software Have an extensible search engine Understand what¡¯s happening under the hood Integrate with other user projects : recommendation, af?nity, activity
  • 5. SEABRO Project Replace current proprietary Endeca search engine to a modern search engine Get more relevant results Flexible API Facet searching (for browsing) Ability to scale out 3 months in execution (1st phase)
  • 6. Why ElasticSearch? Built from ground up to scale out Powerful DSL Schema-less -> JSON Distributed Lucene Very powerful API, allows automation of the whole cluster using simple curl commands It¡¯s bonsai cool
  • 7. SOLR: The contender Powered by Lucene Just too much XML ?les to get anything done Uses XML No Query DSL
  • 8. Elasticsearch at a glance client q1 node r1 client node q1 client r1
  • 9. Cluster with 1 node Cluster state Node name Cool Names Shards
  • 10. Cluster with 2 nodes : Rebalancing Shard being relocated
  • 11. Cluster with 2 nodes : stable Recently relocated shard
  • 12. Cluster with 3 nodes : stable
  • 13. Cluster with 3 nodes : adding replicas Yellow state, not all replicas set
  • 14. Cluster with 3 nodes : adding replicas Index aliasing Primary Shard
  • 15. Cluster with 3 nodes : node crash Without replication, our cluster now is missing one shard.
  • 16. Client Node types Three types Data node : joins the cluster, fetch shard data Client node : joins the cluster, participates on sorting operations, don¡¯t fetch data Transport node : Don¡¯t join the cluster, only send requests towards it
  • 17. Extensible architecture : plugins Site : GUI interfaces River : integration hook to fetch data from other sources (DB, MQ, FS ... ) Transport : Allow different transports to be plugged into ES core Scripting : Allow adding new languages to the scripting system Analysis : Custom analyzers for indexing/ searching Misc : You know, everything else
  • 19. Our numbers 25+ million documents Multi types: Songs, Albums, Artists, Audio Books, Composers, Labels, Authors, Conductors, Composers ... 5+ million search requests per day ~ 100 gb index. And it only takes 1 hour to build it from ground up (thanks to Akka engine)
  • 20. Indexing ?ow ES cluster a k Oracle k a actors
  • 22. Lesson #1 Get professional Help elasticsearch.org is very well documented But when it comes to prod, ask the experts Get professional support from elasticsearch.com
  • 23. Lesson #2 Understand shards Sharding is what make distributable search possible Understanding what they mean and how can they speed up your engine is a must
  • 25. Lesson #2 Understand shards Increasing the number of shards will boost your query times Each shard maps to a Lucene Index Reader/ Writer, the more power your box has, the more shards you should have Replication will boost cluster response time
  • 26. Lesson # 3 Design your data ?ow ahead of your schema The way you model your schemas have a deep impact on how fast your engine can become Don¡¯t be afraid to replicate information using a different structure
  • 27. Lesson #4 Master Query DSL Just like you know SQL, you should understand the query DSL pretty very well Indexed Data won¡¯t ?nd itself Understand that sometimes you must change data representation in order to get things to be found
  • 28. Lesson #5 Learn at least a bit about lucene internals Understanding how lucene¡¯s scoring works helps designing better queries Elasticsearch supports custom score using scripts Could hurt you on performance :(
  • 29. Lesson #6 Put slow queries to work. Use explain Explain gives useful information on how documents are being scored Slow query log will show you which queries are actually hurting you Sometimes its just document cache misses
  • 30. Lesson #7 Take GC by the horns ES nodes can demand a lot of memory JDK still thinks its 2003 when it comes to memory size Memory fragmentation Full GC times can bring your cluster to its knees
  • 31. Lesson #7 Take GC by the horns Maximum 30 GB per node Beefy machines = more nodes per machine Changed full GC threshold to start when memory reaches 60% -> Giving JVM plenty of room until memory is claimed
  • 32. Lesson #8 Caching can eat up your memory Caching is a necessary evil but: Field cache stores sorted and faceted data Filter cache stores ?ltered data Cache eviction must be controlled
  • 33. Lesson #8 Caching can eat up your memory Your queries and how you facet will have a huge impact on cache size Bigger your shard is, more memory you will need for caching Facet caching for multi valued ?elds in 0.20 is not optimal, take that in consideration
  • 34. Lesson #9 Monitor your cluster Keep an eye on your cluster It¡¯s vital to monitor both system metrics (CPU, memory, ?le system) but also correlate that with query information ES provides nice plugins like bigdesk and paramedic. But history is vital so get something like sematext SPM
  • 35. Lesson #10 Distributed systems are hard Needless to say, but don¡¯t expect all that power to come for free
  • 36. Lesson #11 Have an A/B testing suite ready De?ning relevancy is hard People have different views on relevance Hard to explain to a user why Joe Doe does not show up on its query results
  • 37. Lesson # 11 Have an A/B testing suite ready Start with a baseline search that returns ¡°relevant enough¡± results Give points for every record found, the higher it is the more points it get Sum it all, and you have your score When updating your queries, run the suite and check if you get better results
  • 38. Lesson # 12 Track user interaction Monitor how many ¡°clicks¡± your users are executing once you changed queries Again, your de?nition of relevant may not be what your users expect Adapt
  • 39. Final words In the end, ES proved to be a very reliable and affordable solution Not only we increased the quality of results but we have also reduced the query response times Request time dropped over 200%. Cluster size reduced by 400% and with a 80% increase in load YES We did save money and increased quality at the same time
  • 41. Classify data Classify data during indexing time instead of using custom scripts