�ݺ�ߣ

Vinyl:
why we wrote our own
write-optimized storage
engine rather than chose
RocksDB
kostja@tarantool.org
Konstantin Osipov

Plan
● A review of log structured merge tree data structure
● Vinyl engine in Tarantool
● Configuration parameters
● Key use cases

Why build, rather than buy: rocksdb, forestdb, ...
Transaction Control
- read your own writes
- multi-engine
- undo log
Engine 1
indexing
checkpoint
storage mgmt
Write ahead log
- redo log
- async relay
- group replication
I/O
- session
- parser
Engine 2
…
…
...

DELETE in an LSM tree
● We can’t delete from append-only files
● Tombstones (delete markers) are inserted into L0 instead
26
Disk
3 8 15 26 35 40 45 48
10 25 36 42
22 37
Memory

Disk
3 8 15 26 35 40 45 48
10 25 36 42
22 26
Memory
37

Disk
3 8 10 15 22 35 36 37
Memory
40 42 45 4825

SELECT in LSM tree
● Search in all levels, until the key is found
● Optimistic for point queries
● Merge sort in case of range select

SELECT in LSM tree
Disk
2 6 7 10 14 16 23 28 30 32 37 38 41 45 47 49
3 8 15 26 35 40 45 48
10 25 36 42
22 37
GET(16) Memory

LSM: the algorithmic limit
LSM tree B-tree
Search K * O(log2
N/B) O(logB
N)
Delete O(log2
(N)/B) O(logB
N)
Insert O(log2
(N)/B) O(logB
N)

RUM conjecture
The ubiquitous fight between
the Read, the Update, and the
Memory overhead of access
methods for modern data
systems

Key LSM challenges in Web/OLTP
● Slow reads: read amplification
● Potentially write the same data multiple times: write
amplification
● Keep garbage around: space amplification
● Response times affected by background activity: latency spikes

Vinyl: memtable, sorted run, dump & compact
● Stores statements, not values:
● REPLACE,
● DELETE,
● UPSERT
● Every statement is marked by LSN
● Append-only files, garbage collected after checkpoint
● Transactional log of all filesystem changes: vylog
key lsn op_code value

Memtable, sorted run, dump & compact
Disk
3 8 15 26 36 40 45
10 25 36 40
Memory
memtable
22 37
dump
sorted run
compact
3 8 10 15 25 26 36 40 45

Vinyl: read [25, 30)
3 8 15 26
1 25 36
15 26 29
4
25 26 29merge
Uncommitted changes
Tuple cache
L0
Sorted run 83

Keeping the latency within limits
● anticipatory dump
● throttling
Disk
Memory
memtable
for writes
dump
3 8 10 15 25 26 36 40 45
22 37
22 37
shadow

Reducing read amplification
● page index
● bloom filters
● tuple range cache
● multi-level compaction
Disk22 37
3 8 10 15 25 26 36 40 45
3 15 22 36 22 25 26
page index tuple cachewrites
001010101010100110
110101011100000000
Bloom filter

Reducing write amplification
● Multi-level compaction can span any number of levels
● A level can contain multiple runs
Disk65 92
2 19 26 29 32 38 43 46 79
writes
14 15 35 89
22 23 27 31 37 46 65 78 94
3 9 28 33 44 47 48 54 56 59 61 66 75 81 93 94 96 98 99

Reducing space amplification
● Ranges reflect a static layout of sorted runs
● Slices connect a sorted run into a range
Disk65 92
2 3 9 14 15 22
14 15 35 89
23 27 28 31 33 37 44 81 93 94 96 98
Memory
[-inf, 23) [-23, 45) [-45, +inf)

Secondary key support
● L0 and tuple cache use memtx technology
● This makes it possible to include L0 and tuple cache in
checkpoint, so no cold restart
● L1+ stores the value of primary key as tuple id
● Unique secondary keys rely on bloom filter heavily for
INSERT, REPLACE
● REPLACE in a non-unique secondary is a blind write, garbage
is pruned at compaction

The scheduler
● The only active entity inside vinyl
engine
● Maintains two queues:
compact and dump
● Each range is in each queue,
parallel compaction
● Dump always trumps compact
● Chunked garbage collection for
quick memory reclamation
● Is a fiber, so needs no locks

Transactions
● MVCC
● the first transaction to commit wins
● no waits, no deadlocks, only aborts
● yields don’t abort transactions

Replication & backup
Work out of the box

Limitations
● Tuple size is up to 16M
● You need ~5k file descriptors per 1TB of data
● Optimistic transaction control is bad for long read-write
transactions
● No cross-engine transactions yet

Configuration
Name Description Default
bloom_fpr Bloom filter max false positive rate 5%
page_size Approximate page size 8192
range_size Approximate range size 1G
run_count_per_level How many sorted runs are allowed on a
single level
2
run_size_ratio Run size ratio between levels 3.5
memory L0 total size 0.25G
cache Tuple cache size 0.25G
threads The number of compaction threads 2

UPSERT
● Non-reading update or insert
● Delayed execution
● Background upsert squashing prevents upserts from piling up
space:upsert(tuple, {{operator, field, value}, ...})

@kostja_osipov fb.com/TarantoolDatabase
www.tarantool.orgkostja@tarantool.org
Thank you! Questions?

Links
Leveled compaction in Apache Cassandra
http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra
http://www.scylladb.com/kb/compaction/
https://github.com/facebook/rocksdb/wiki/Universal-Compaction
https://dom.as/2015/04/09/how-innodb-lost-its-advantage/

�ݺ�ߣ

vinyl

More Related Content

vinyl