�ݺ�ߣ

ElasticSearch from the
trenches

Vinicius Carvalho / @vccarvalho
blog.furiousbob.com

About

emusic.com is one of the leading digital music
retailers, is committed to serving music
enthusiasts with aggressive development of
tools and features solely designed to
enhance personalized music experience

Searching is a BIG part of emusic.com
discovery experience

Search challenges

Results MUST be relevant

How do we de?ne relevancy?

Low response times ( < 100ms)

High availability (Users don��t tolerate not
?nding what they are looking for)

Static search results are old school, your
engine should ��know�� your users preferences

How to recover from catastrophic crashes

Goals

Improve results relevancy : The Adele 21
problem

Get rid of proprietary software

Have an extensible search engine

Understand what��s happening under the hood

Integrate with other user projects :
recommendation, af?nity, activity

SEABRO Project
Replace current proprietary Endeca search
engine to a modern search engine

Get more relevant results

Flexible API

Facet searching (for browsing)

Ability to scale out

3 months in execution (1st phase)

Why ElasticSearch?

Built from ground up to scale out

Powerful DSL

Schema-less -> JSON

Distributed Lucene

Very powerful API, allows automation of
the whole cluster using simple curl
commands

It��s bonsai cool

SOLR: The contender

Powered by Lucene

Just too much XML ?les to get anything
done

Uses XML

No Query DSL

Elasticsearch at a glance

client
q1
node
r1
client

node
q1
client

r1

Cluster with 1 node
Cluster state

Node name

Cool Names

Shards

Cluster with 2 nodes : Rebalancing

Shard being
relocated

Cluster with 2 nodes : stable

Recently relocated shard

Cluster with 3 nodes : adding replicas
Yellow state, not all
replicas set

Cluster with 3 nodes : adding replicas

Index aliasing

Primary Shard

Cluster with 3 nodes : node crash
Without replication, our
cluster now is missing
one shard.

Client Node types

Three types

Data node : joins the cluster, fetch shard
data

Client node : joins the cluster, participates
on sorting operations, don��t fetch data

Transport node : Don��t join the cluster,
only send requests towards it

Extensible architecture : plugins
Site : GUI interfaces

River : integration hook to fetch data from
other sources (DB, MQ, FS ... )

Transport : Allow different transports to be
plugged into ES core

Scripting : Allow adding new languages to
the scripting system

Analysis : Custom analyzers for indexing/
searching

Misc : You know, everything else

Our numbers

25+ million documents

Multi types: Songs, Albums, Artists, Audio
Books, Composers, Labels, Authors,
Conductors, Composers ...

5+ million search requests per day

~ 100 gb index. And it only takes 1 hour to
build it from ground up (thanks to Akka
engine)

Indexing ?ow

ES cluster

a
k
Oracle
k
a
actors

Lesson #1
Get professional Help

elasticsearch.org is very
well documented

But when it comes to
prod, ask the experts

Get professional support
from elasticsearch.com

Lesson #2
Understand shards

Sharding is what make
distributable search
possible

Understanding what they
mean and how can they
speed up your engine is a
must

Lesson #2
Understand shards

Increasing the number of shards will boost
your query times

Each shard maps to a Lucene Index Reader/
Writer, the more power your box has, the
more shards you should have

Replication will boost cluster response time

Lesson # 3
Design your data ?ow ahead of your
schema

The way you model your schemas have a
deep impact on how fast your engine can
become

Don��t be afraid to replicate information using
a different structure

Lesson #4
Master Query DSL

Just like you know SQL, you should
understand the query DSL pretty very well

Indexed Data won��t ?nd itself

Understand that sometimes you must change
data representation in order to get things to
be found

Lesson #5
Learn at least a bit about lucene
internals

Understanding how lucene��s scoring works
helps designing better queries

Elasticsearch supports custom score using
scripts

Could hurt you on performance :(

Lesson #6
Put slow queries to work. Use
explain

Explain gives useful information on how
documents are being scored

Slow query log will show you which queries
are actually hurting you

Sometimes its just document cache misses

Lesson #7
Take GC by the horns

ES nodes can demand a lot of memory

JDK still thinks its 2003 when it comes to
memory size

Memory fragmentation

Full GC times can bring your cluster to its
knees

Lesson #7
Take GC by the horns

Maximum 30 GB per node

Beefy machines = more nodes per machine

Changed full GC threshold to start when
memory reaches 60% -> Giving JVM plenty
of room until memory is claimed

Lesson #8
Caching can eat up your memory

Caching is a necessary evil but:

Field cache stores sorted and faceted data

Filter cache stores ?ltered data

Cache eviction must be controlled

Lesson #8
Caching can eat up your memory

Your queries and how you facet will have a
huge impact on cache size

Bigger your shard is, more memory you will
need for caching

Facet caching for multi valued ?elds in 0.20
is not optimal, take that in consideration

Lesson #9
Monitor your cluster

Keep an eye on your cluster

It��s vital to monitor both system metrics
(CPU, memory, ?le system) but also correlate
that with query information

ES provides nice plugins like bigdesk and
paramedic. But history is vital so get
something like sematext SPM

Lesson #10
Distributed systems are hard

Needless to
say, but don��t
expect all
that power
to come for
free

Lesson #11
Have an A/B testing suite ready

De?ning relevancy is hard

People have different views on relevance

Hard to explain to a user why Joe Doe does
not show up on its query results

Lesson # 11
Have an A/B testing suite ready

Start with a baseline search that returns ��relevant
enough�� results

Give points for every record found, the higher it is
the more points it get

Sum it all, and you have your score

When updating your queries, run the suite and
check if you get better results

Lesson # 12
Track user interaction

Monitor how many ��clicks�� your users are
executing once you changed queries

Again, your de?nition of relevant may not be
what your users expect

Adapt

Final words
In the end, ES proved to be a very reliable
and affordable solution

Not only we increased the quality of results
but we have also reduced the query
response times

Request time dropped over 200%. Cluster
size reduced by 400% and with a 80%
increase in load

YES We did save money and increased
quality at the same time

Classify data
Classify data during indexing time instead of
using custom scripts

�ݺ�ߣ

Elastic search from the trenches

More Related Content

Elastic search from the trenches