This is my talk on Vinyl storage engine internals at Percona Live Santa Clara 2017. Vinyl is a write opitmized storage engine based on log structured merge tree data structure that we delivered in Tarantool 1.7.
1 of 28
Downloaded 23 times
More Related Content
vinyl
1. Vinyl:
why we wrote our own
write-optimized storage
engine rather than chose
RocksDB
kostja@tarantool.org
Konstantin Osipov
2. Plan
A review of log structured merge tree data structure
Vinyl engine in Tarantool
Configuration parameters
Key use cases
3. Why build, rather than buy: rocksdb, forestdb, ...
Transaction Control
- read your own writes
- multi-engine
- undo log
Engine 1
indexing
checkpoint
storage mgmt
Write ahead log
- redo log
- async relay
- group replication
I/O
- session
- parser
Engine 2
...
5. DELETE in an LSM tree
We cant delete from append-only files
Tombstones (delete markers) are inserted into L0 instead
26
Disk
3 8 15 26 35 40 45 48
10 25 36 42
22 37
Memory
6. DELETE in an LSM tree
Disk
3 8 15 26 35 40 45 48
10 25 36 42
22 26
Memory
37
7. DELETE in an LSM tree
Disk
3 8 10 15 22 35 36 37
Memory
40 42 45 4825
8. SELECT in LSM tree
Search in all levels, until the key is found
Optimistic for point queries
Merge sort in case of range select
10. LSM: the algorithmic limit
LSM tree B-tree
Search K * O(log2
N/B) O(logB
N)
Delete O(log2
(N)/B) O(logB
N)
Insert O(log2
(N)/B) O(logB
N)
11. RUM conjecture
The ubiquitous fight between
the Read, the Update, and the
Memory overhead of access
methods for modern data
systems
12. Key LSM challenges in Web/OLTP
Slow reads: read amplification
Potentially write the same data multiple times: write
amplification
Keep garbage around: space amplification
Response times affected by background activity: latency spikes
13. Vinyl: memtable, sorted run, dump & compact
Stores statements, not values:
REPLACE,
DELETE,
UPSERT
Every statement is marked by LSN
Append-only files, garbage collected after checkpoint
Transactional log of all filesystem changes: vylog
key lsn op_code value
19. Reducing space amplification
Ranges reflect a static layout of sorted runs
Slices connect a sorted run into a range
Disk65 92
2 3 9 14 15 22
14 15 35 89
23 27 28 31 33 37 44 81 93 94 96 98
Memory
[-inf, 23) [-23, 45) [-45, +inf)
20. Secondary key support
L0 and tuple cache use memtx technology
This makes it possible to include L0 and tuple cache in
checkpoint, so no cold restart
L1+ stores the value of primary key as tuple id
Unique secondary keys rely on bloom filter heavily for
INSERT, REPLACE
REPLACE in a non-unique secondary is a blind write, garbage
is pruned at compaction
21. The scheduler
The only active entity inside vinyl
engine
Maintains two queues:
compact and dump
Each range is in each queue,
parallel compaction
Dump always trumps compact
Chunked garbage collection for
quick memory reclamation
Is a fiber, so needs no locks
22. Transactions
MVCC
the first transaction to commit wins
no waits, no deadlocks, only aborts
yields dont abort transactions
24. Limitations
Tuple size is up to 16M
You need ~5k file descriptors per 1TB of data
Optimistic transaction control is bad for long read-write
transactions
No cross-engine transactions yet
25. Configuration
Name Description Default
bloom_fpr Bloom filter max false positive rate 5%
page_size Approximate page size 8192
range_size Approximate range size 1G
run_count_per_level How many sorted runs are allowed on a
single level
2
run_size_ratio Run size ratio between levels 3.5
memory L0 total size 0.25G
cache Tuple cache size 0.25G
threads The number of compaction threads 2
26. UPSERT
Non-reading update or insert
Delayed execution
Background upsert squashing prevents upserts from piling up
space:upsert(tuple, {{operator, field, value}, ...})