際際滷

際際滷Share a Scribd company logo
Elasticsearch as (only)
datastore
How-to integrate it to existing project (and leave
everything else)
Tomas Sirny
@junckritter
www.reactor.am
Me
 Python developer
 DevOps role
 using/learning Elasticsearch for 3 years (version 0.16)
 Synopsi.tv, Reactor
Synopsi.tv
 movie recommendations service
 database of movies, TV shows
 need for search - Elasticsearch
 search-box on every page (prefix search)
 advanced search (search + facets)
 PostgreSQL as main datastore
 import to ES with script, hooks on add/update
Synopsi.tv - lessons
 Good for search
 Mappings are powerful
 need for reindexing - format/mapping change
 sometimes missing documents (Bruce Willis) - in index but not
searchable
 probably not yet suitable as only datastore
Reactor
 service for communication with users of your application
 send data about users and their activity (events)
 filter users, define segments
 set rules for reactions - email, webhook, SMS, etc.
Data structure
 small and simple pieces of data - JSON
 simple relations (application <-> user <-> field)
 noSQL more suitable, but doable in e.g. PostgreSQL
Backends - theory
 application saves data in several forms:
 raw (as they come)
 cleaned and sanitized
 formatted for specific datastore
 save method iterates over configured set of datastore backends and
send same data to them
 different backends for different operations - get, filter, analytics
 slight duplication of functionality in application
Backends - practice
 Mongo/DynamoDB - raw data, cold storage
 PostgreSQL - working data
 ElasticSearch - added later, working data, analytics
 different sets in different environments (devel machine,
production)
Source of truth
 one backend is trusted by definition - cold storage, need for
repopulation of data in other backends
 should be simple (hard to break) and scalable (probably in cloud)
 possible forms:
 JSON files, Hadoop, etc.
 noSQL database (Mongo, Cassandra, DynamoDB)
 Elasticsearch - different format (indices, nodes) from working
data
Input to ElasticSearch
 regularly run import/update script - if you do not need (almost-)
live data
 logic in application (our case)
 river - input channel from other source to ElasticSearch
(CouchDB, MongoDB, Hadoop)
ElasticSearch pros
 easy start
 easy scaling (e.g. with AWS plugin can new node automatically join
cluster and be ready in short time)
 search capabilities
 analytics with facets/aggregations (Kibana plugin)
 easy backup with snapshots
 easy to deploy - just one service
 highly tweakable yet sane defaults
ElasticSearch cons
 only one type of relation between documents - parent/child
(nested)
 higher need for reindexing - repopulate data from scratch
(change of format, new mappings for fields)
Lessons learned
Clusters
 Possible to use several clusters - one for data, one for
monitoring
 Can talk to each other via tribe nodes - nodes from cluster1
send monitoring data (Marvel plugin) to tribe node to save them
to cluster2
Nodes
 Use data and non-data nodes
 Data nodes for storage
 Non-data nodes for admin, local nodes on web server, etc.
 Use tags
Indices
 Split data to different indices (per client, segments, etc.) -
reactor_app-id
 Use aliases (migrations, time windows, etc.) - reactor_app-id as
alias for reactor_app-id_timestamp1 and reactor_app-
id_timestamp2
 Possible to allocate bigger indices to more performant servers -
node.tag and index.routing.allocation.include.tag
Shards
 Use several shards (default 5)
 Utilize more machines
 Immutable for existing index - plan your infrastructure
Indexing
 If possible, use batch indexing
 Use update script wisely - too big batch with updates can slow
down application
 Send indexing traffic to (local) non-data node
 Play with mapping, use not_analyzed, field_data
Machines
 Use local SSD, not network storage
 Replicate, make snapshots
 Make benchmark - few bigger machines can outperform more
small machines (network delays, cluster management)
 If in cloud, use ElasticSearch plugins for that cloud provider
(cluster discovery, snapshots on cloud storage)
Thanks.

More Related Content

ElasticSearch as (only) datastore

  • 1. Elasticsearch as (only) datastore How-to integrate it to existing project (and leave everything else) Tomas Sirny @junckritter www.reactor.am
  • 2. Me Python developer DevOps role using/learning Elasticsearch for 3 years (version 0.16) Synopsi.tv, Reactor
  • 3. Synopsi.tv movie recommendations service database of movies, TV shows need for search - Elasticsearch search-box on every page (prefix search) advanced search (search + facets) PostgreSQL as main datastore import to ES with script, hooks on add/update
  • 4. Synopsi.tv - lessons Good for search Mappings are powerful need for reindexing - format/mapping change sometimes missing documents (Bruce Willis) - in index but not searchable probably not yet suitable as only datastore
  • 5. Reactor service for communication with users of your application send data about users and their activity (events) filter users, define segments set rules for reactions - email, webhook, SMS, etc.
  • 6. Data structure small and simple pieces of data - JSON simple relations (application <-> user <-> field) noSQL more suitable, but doable in e.g. PostgreSQL
  • 7. Backends - theory application saves data in several forms: raw (as they come) cleaned and sanitized formatted for specific datastore save method iterates over configured set of datastore backends and send same data to them different backends for different operations - get, filter, analytics slight duplication of functionality in application
  • 8. Backends - practice Mongo/DynamoDB - raw data, cold storage PostgreSQL - working data ElasticSearch - added later, working data, analytics different sets in different environments (devel machine, production)
  • 9. Source of truth one backend is trusted by definition - cold storage, need for repopulation of data in other backends should be simple (hard to break) and scalable (probably in cloud) possible forms: JSON files, Hadoop, etc. noSQL database (Mongo, Cassandra, DynamoDB) Elasticsearch - different format (indices, nodes) from working data
  • 10. Input to ElasticSearch regularly run import/update script - if you do not need (almost-) live data logic in application (our case) river - input channel from other source to ElasticSearch (CouchDB, MongoDB, Hadoop)
  • 11. ElasticSearch pros easy start easy scaling (e.g. with AWS plugin can new node automatically join cluster and be ready in short time) search capabilities analytics with facets/aggregations (Kibana plugin) easy backup with snapshots easy to deploy - just one service highly tweakable yet sane defaults
  • 12. ElasticSearch cons only one type of relation between documents - parent/child (nested) higher need for reindexing - repopulate data from scratch (change of format, new mappings for fields)
  • 14. Clusters Possible to use several clusters - one for data, one for monitoring Can talk to each other via tribe nodes - nodes from cluster1 send monitoring data (Marvel plugin) to tribe node to save them to cluster2
  • 15. Nodes Use data and non-data nodes Data nodes for storage Non-data nodes for admin, local nodes on web server, etc. Use tags
  • 16. Indices Split data to different indices (per client, segments, etc.) - reactor_app-id Use aliases (migrations, time windows, etc.) - reactor_app-id as alias for reactor_app-id_timestamp1 and reactor_app- id_timestamp2 Possible to allocate bigger indices to more performant servers - node.tag and index.routing.allocation.include.tag
  • 17. Shards Use several shards (default 5) Utilize more machines Immutable for existing index - plan your infrastructure
  • 18. Indexing If possible, use batch indexing Use update script wisely - too big batch with updates can slow down application Send indexing traffic to (local) non-data node Play with mapping, use not_analyzed, field_data
  • 19. Machines Use local SSD, not network storage Replicate, make snapshots Make benchmark - few bigger machines can outperform more small machines (network delays, cluster management) If in cloud, use ElasticSearch plugins for that cloud provider (cluster discovery, snapshots on cloud storage)