際際滷

際際滷Share a Scribd company logo
How to choose
a database
Vsevolod Solovyov
Why bother
 It's the most fundamental thing about the
project
 Even programming languages are switched
more often
 With wrong choice you'll suffer. But why?
Different cases
 I certainly know
what I need
 I'm not sure,
something will do
 This tech is cool!
 Serious business:
risk/bene鍖t
management
 Small project:
??
 Pet project:
fun/learning
management
I know what I need
 Huge time series DB
 Cross-datacenter replication
 Petabytes of data
 ...
 This talk is not for you
This tech is COOL
 Pet projects
 Experiments
 Beware otherwise
Consider this
 Data correctness (ACID, enforced schema)
 Easy data modeling
 Operational complexity
 Migrations
 Scaling
 Project use cases  no one knows them
yet!
Data correctness
 Silently losing data is not much fun
 Schema-less is a lie
 Heterogeneous data is hard to analyze,
change, display
 Especially true in data-heavy projects
Data modeling
 In document-oriented DB (e.g. MongoDB)
we need to specially craft "tables"
according to anticipated queries.
And re-craft them when queries change!
 Much easier in RDBMS, just dump it in
adequate tables and slap some indexes
 Datomic is best here: 
Entity-Attribute-Value
Migrations
 Often overlooked part
 Keep them in repository!
 Transactional DDL
 Developing migrations in REPL
 Downgrade migrations are useless
Migration tools
 Native migrations (SQL, CQL, Datalog, etc)
are best
 Don't do auto-migrations ever
 Don't use tools that give you auto-
migrations
 Use something like nomad, migrate
Effort
Migration complexity
ORM migration tools Native (SQL, etc)
Tools learning curve
Complex migrations
 Create new column/table/database
 Read from the old place and write to both
 Migrate all old data to the new place
 Read from both places and compare
 Clean up old code and data
Performance
Data model matters here
Read documentation!
Low-level details
 Pages
 Rows, columns, column families
 MVCC
 TOAST
 Index types: B-Tree, hash, GIN, GiST, BRIN
 Sharding hash/range, sharding key
Know your tools!
 Log slow queries (pg_stat, slowlog, etc)
 EXPLAIN
 EXPLAIN (ANALYZE,VERBOSE, BUFFERS)
 htop, iotop, perf, etc
ORMs considered
harmful
 They provoke massive data over-fetch
 Easy-to-miss 1+N queries
 Hard to refactor and move parts of data to
other DBs
 Very leaky abstraction
Over-fetch
clustering = Clustering.query.get(33)
depth2cid = defaultdict(list)
for cl in clustering.clusters:
depth2cid[cl.level].append(cl.id)
Over-fetch
clustering = Clustering.query.get(33)
depth2cid = defaultdict(list)
for cl in clustering.clusters:
depth2cid[cl.level].append(cl.id)
11.4 Gb RAM
50 seconds
Proper-fetch
query = (
db.session
.query(Cluster.id, Cluster.level)
.filter(Cluster.clustering_id == 33))
depth2cid = defaultdict(list)
for cid, level in query:
depth2cid[level].append(cid)
54 Mb RAM
1 second
Properer-fetch
7 Mb RAM
1.2 seconds
query = (
db.session
.query(Cluster.id, Cluster.level)
.filter(Cluster.clustering_id == 33)
.yield_per(1000))
depth2cid = defaultdict(list)
for cid, level in query:
depth2cid[level].append(cid)
Scale
Distributed FUD
 http://jepsen.io/
 CORDS: Redundancy does not imply fault
tolerance - the morning paper
 Do you need it really? RAM is plentiful
Buy bigger server
Scale
Buy bigger
server
Scale
Well...
By that time you will
know what you need
Horizontally scalable
 Citus
 VoltDB
 Cassandra
 CockroachDB
 ...
Pull out data bit-by-bit
A
AAvailability
Availability
 Services mostly die from other problems
 Untested "available" DB can be a problem
 Properly available system (CAP-available) is
a pain and resource sink
NoDB
 Images
 Machine learning models
 ...
 File system, S3, B2, etc
At Cap'n Obvious
 Experiment at home
 Don't bring new DB in a big project just
because it's interesting
 Not sure? Postgres to the rescue

More Related Content

How to choose a database