2. Me
Python developer
DevOps role
using/learning Elasticsearch for 3 years (version 0.16)
Synopsi.tv, Reactor
3. Synopsi.tv
movie recommendations service
database of movies, TV shows
need for search - Elasticsearch
search-box on every page (prefix search)
advanced search (search + facets)
PostgreSQL as main datastore
import to ES with script, hooks on add/update
4. Synopsi.tv - lessons
Good for search
Mappings are powerful
need for reindexing - format/mapping change
sometimes missing documents (Bruce Willis) - in index but not
searchable
probably not yet suitable as only datastore
5. Reactor
service for communication with users of your application
send data about users and their activity (events)
filter users, define segments
set rules for reactions - email, webhook, SMS, etc.
6. Data structure
small and simple pieces of data - JSON
simple relations (application <-> user <-> field)
noSQL more suitable, but doable in e.g. PostgreSQL
7. Backends - theory
application saves data in several forms:
raw (as they come)
cleaned and sanitized
formatted for specific datastore
save method iterates over configured set of datastore backends and
send same data to them
different backends for different operations - get, filter, analytics
slight duplication of functionality in application
8. Backends - practice
Mongo/DynamoDB - raw data, cold storage
PostgreSQL - working data
ElasticSearch - added later, working data, analytics
different sets in different environments (devel machine,
production)
9. Source of truth
one backend is trusted by definition - cold storage, need for
repopulation of data in other backends
should be simple (hard to break) and scalable (probably in cloud)
possible forms:
JSON files, Hadoop, etc.
noSQL database (Mongo, Cassandra, DynamoDB)
Elasticsearch - different format (indices, nodes) from working
data
10. Input to ElasticSearch
regularly run import/update script - if you do not need (almost-)
live data
logic in application (our case)
river - input channel from other source to ElasticSearch
(CouchDB, MongoDB, Hadoop)
11. ElasticSearch pros
easy start
easy scaling (e.g. with AWS plugin can new node automatically join
cluster and be ready in short time)
search capabilities
analytics with facets/aggregations (Kibana plugin)
easy backup with snapshots
easy to deploy - just one service
highly tweakable yet sane defaults
12. ElasticSearch cons
only one type of relation between documents - parent/child
(nested)
higher need for reindexing - repopulate data from scratch
(change of format, new mappings for fields)
14. Clusters
Possible to use several clusters - one for data, one for
monitoring
Can talk to each other via tribe nodes - nodes from cluster1
send monitoring data (Marvel plugin) to tribe node to save them
to cluster2
15. Nodes
Use data and non-data nodes
Data nodes for storage
Non-data nodes for admin, local nodes on web server, etc.
Use tags
16. Indices
Split data to different indices (per client, segments, etc.) -
reactor_app-id
Use aliases (migrations, time windows, etc.) - reactor_app-id as
alias for reactor_app-id_timestamp1 and reactor_app-
id_timestamp2
Possible to allocate bigger indices to more performant servers -
node.tag and index.routing.allocation.include.tag
17. Shards
Use several shards (default 5)
Utilize more machines
Immutable for existing index - plan your infrastructure
18. Indexing
If possible, use batch indexing
Use update script wisely - too big batch with updates can slow
down application
Send indexing traffic to (local) non-data node
Play with mapping, use not_analyzed, field_data
19. Machines
Use local SSD, not network storage
Replicate, make snapshots
Make benchmark - few bigger machines can outperform more
small machines (network delays, cluster management)
If in cloud, use ElasticSearch plugins for that cloud provider
(cluster discovery, snapshots on cloud storage)