�ݺ�ߣ

How to choose
a database
Vsevolod Solovyov

Why bother
• It's the most fundamental thing about the
project
• Even programming languages are switched
more often
• With wrong choice you'll suffer. But why?

Different cases
• I certainly know
what I need
• I'm not sure,
something will do
• This tech is cool!
• Serious business:
risk/beneﬁt
management
• Small project: 
??
• Pet project: 
fun/learning
management

I know what I need
• Huge time series DB
• Cross-datacenter replication
• Petabytes of data
• ...
• This talk is not for you

This tech is COOL
• Pet projects
• Experiments
• Beware otherwise

Consider this
• Data correctness (ACID, enforced schema)
• Easy data modeling
• Operational complexity
• Migrations
• Scaling
• Project use cases — no one knows them
yet!

Data correctness
• Silently losing data is not much fun
• Schema-less is a lie
• Heterogeneous data is hard to analyze,
change, display
• Especially true in data-heavy projects

Data modeling
• In document-oriented DB (e.g. MongoDB)
we need to specially craft "tables"
according to anticipated queries. 
And re-craft them when queries change!
• Much easier in RDBMS, just dump it in
adequate tables and slap some indexes
• Datomic is best here:  
Entity-Attribute-Value

Migrations
• Often overlooked part
• Keep them in repository!
• Transactional DDL
• Developing migrations in REPL
• Downgrade migrations are useless

Migration tools
• Native migrations (SQL, CQL, Datalog, etc)
are best
• Don't do auto-migrations ever
• Don't use tools that give you auto-
migrations
• Use something like nomad, migrate

Effort
Migration complexity
ORM migration tools Native (SQL, etc)
Tools learning curve

Complex migrations
• Create new column/table/database
• Read from the old place and write to both
• Migrate all old data to the new place
• Read from both places and compare
• Clean up old code and data

Performance
Data model matters here

Low-level details
• Pages
• Rows, columns, column families
• MVCC
• TOAST
• Index types: B-Tree, hash, GIN, GiST, BRIN
• Sharding hash/range, sharding key

Know your tools!
• Log slow queries (pg_stat, slowlog, etc)
• EXPLAIN
• EXPLAIN (ANALYZE,VERBOSE, BUFFERS)
• htop, iotop, perf, etc

ORMs considered
harmful
• They provoke massive data over-fetch
• Easy-to-miss 1+N queries
• Hard to refactor and move parts of data to
other DBs
• Very leaky abstraction

Over-fetch
clustering = Clustering.query.get(33)
depth2cid = defaultdict(list)
for cl in clustering.clusters:
depth2cid[cl.level].append(cl.id)

Over-fetch
clustering = Clustering.query.get(33)
for cl in clustering.clusters:
depth2cid[cl.level].append(cl.id)
11.4 Gb RAM
50 seconds

Proper-fetch
query = (
db.session
.query(Cluster.id, Cluster.level)
.filter(Cluster.clustering_id == 33))
for cid, level in query:
depth2cid[level].append(cid)
54 Mb RAM
1 second

Properer-fetch
7 Mb RAM
1.2 seconds
query = (
db.session
.query(Cluster.id, Cluster.level)
.filter(Cluster.clustering_id == 33)
.yield_per(1000))
for cid, level in query:
depth2cid[level].append(cid)

Distributed FUD
• http://jepsen.io/
• CORDS: Redundancy does not imply fault
tolerance - the morning paper
• Do you need it really? RAM is plentiful

By that time you will
know what you need

Horizontally scalable
• Citus
• VoltDB
• Cassandra
• CockroachDB
• ...

Availability
• Services mostly die from other problems
• Untested "available" DB can be a problem
• Properly available system (CAP-available) is
a pain and resource sink

NoDB
• Images
• Machine learning models
• ...
• File system, S3, B2, etc

At Cap'n Obvious
• Experiment at home
• Don't bring new DB in a big project just
because it's interesting
• Not sure? Postgres to the rescue

�ݺ�ߣ

How to choose a database

More Related Content

How to choose a database