際際滷

際際滷Share a Scribd company logo
Vinyl:
why we wrote our own
write-optimized storage
engine rather than chose
RocksDB
kostja@tarantool.org
Konstantin Osipov
Plan
 A review of log structured merge tree data structure
 Vinyl engine in Tarantool
 Configuration parameters
 Key use cases
Why build, rather than buy: rocksdb, forestdb, ...
Transaction Control
- read your own writes
- multi-engine
- undo log
Engine 1
indexing
checkpoint
storage mgmt
Write ahead log
- redo log
- async relay
- group replication
I/O
- session
- parser
Engine 2


...
The shape of a classical LSM
DELETE in an LSM tree
 We cant delete from append-only files
 Tombstones (delete markers) are inserted into L0 instead
26
Disk
3 8 15 26 35 40 45 48
10 25 36 42
22 37
Memory
DELETE in an LSM tree
Disk
3 8 15 26 35 40 45 48
10 25 36 42
22 26
Memory
37
DELETE in an LSM tree
Disk
3 8 10 15 22 35 36 37
Memory
40 42 45 4825
SELECT in LSM tree
 Search in all levels, until the key is found
 Optimistic for point queries
 Merge sort in case of range select
SELECT in LSM tree
Disk
2 6 7 10 14 16 23 28 30 32 37 38 41 45 47 49
3 8 15 26 35 40 45 48
10 25 36 42
22 37
GET(16) Memory
LSM: the algorithmic limit
LSM tree B-tree
Search K * O(log2
N/B) O(logB
N)
Delete O(log2
(N)/B) O(logB
N)
Insert O(log2
(N)/B) O(logB
N)
RUM conjecture
The ubiquitous fight between
the Read, the Update, and the
Memory overhead of access
methods for modern data
systems
Key LSM challenges in Web/OLTP
 Slow reads: read amplification
 Potentially write the same data multiple times: write
amplification
 Keep garbage around: space amplification
 Response times affected by background activity: latency spikes
Vinyl: memtable, sorted run, dump & compact
 Stores statements, not values:
 REPLACE,
 DELETE,
 UPSERT
 Every statement is marked by LSN
 Append-only files, garbage collected after checkpoint
 Transactional log of all filesystem changes: vylog
key lsn op_code value
Memtable, sorted run, dump & compact
Disk
3 8 15 26 36 40 45
10 25 36 40
Memory
memtable
22 37
dump
sorted run
compact
3 8 10 15 25 26 36 40 45
Vinyl: read [25, 30)
3 8 15 26
1 25 36
15 26 29
4
25 26 29merge
Uncommitted changes
Tuple cache
L0
Sorted run 83
Keeping the latency within limits
 anticipatory dump
 throttling
Disk
Memory
memtable
for writes
dump
3 8 10 15 25 26 36 40 45
22 37
22 37
shadow
Reducing read amplification
 page index
 bloom filters
 tuple range cache
 multi-level compaction
Disk22 37
3 8 10 15 25 26 36 40 45
3 15 22 36 22 25 26
page index tuple cachewrites
001010101010100110
110101011100000000
Bloom filter
Reducing write amplification
 Multi-level compaction can span any number of levels
 A level can contain multiple runs
Disk65 92
2 19 26 29 32 38 43 46 79
writes
14 15 35 89
22 23 27 31 37 46 65 78 94
3 9 28 33 44 47 48 54 56 59 61 66 75 81 93 94 96 98 99
Reducing space amplification
 Ranges reflect a static layout of sorted runs
 Slices connect a sorted run into a range
Disk65 92
2 3 9 14 15 22
14 15 35 89
23 27 28 31 33 37 44 81 93 94 96 98
Memory
[-inf, 23) [-23, 45) [-45, +inf)
Secondary key support
 L0 and tuple cache use memtx technology
 This makes it possible to include L0 and tuple cache in
checkpoint, so no cold restart
 L1+ stores the value of primary key as tuple id
 Unique secondary keys rely on bloom filter heavily for
INSERT, REPLACE
 REPLACE in a non-unique secondary is a blind write, garbage
is pruned at compaction
The scheduler
 The only active entity inside vinyl
engine
 Maintains two queues:
compact and dump
 Each range is in each queue,
parallel compaction
 Dump always trumps compact
 Chunked garbage collection for
quick memory reclamation
 Is a fiber, so needs no locks
Transactions
 MVCC
 the first transaction to commit wins
 no waits, no deadlocks, only aborts
 yields dont abort transactions
Replication & backup
Work out of the box
Limitations
 Tuple size is up to 16M
 You need ~5k file descriptors per 1TB of data
 Optimistic transaction control is bad for long read-write
transactions
 No cross-engine transactions yet
Configuration
Name Description Default
bloom_fpr Bloom filter max false positive rate 5%
page_size Approximate page size 8192
range_size Approximate range size 1G
run_count_per_level How many sorted runs are allowed on a
single level
2
run_size_ratio Run size ratio between levels 3.5
memory L0 total size 0.25G
cache Tuple cache size 0.25G
threads The number of compaction threads 2
UPSERT
 Non-reading update or insert
 Delayed execution
 Background upsert squashing prevents upserts from piling up
space:upsert(tuple, {{operator, field, value}, ...})
@kostja_osipov fb.com/TarantoolDatabase
www.tarantool.orgkostja@tarantool.org
Thank you! Questions?
Links
Leveled compaction in Apache Cassandra
http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra
http://www.scylladb.com/kb/compaction/
https://github.com/facebook/rocksdb/wiki/Universal-Compaction
https://dom.as/2015/04/09/how-innodb-lost-its-advantage/

More Related Content

vinyl

  • 1. Vinyl: why we wrote our own write-optimized storage engine rather than chose RocksDB kostja@tarantool.org Konstantin Osipov
  • 2. Plan A review of log structured merge tree data structure Vinyl engine in Tarantool Configuration parameters Key use cases
  • 3. Why build, rather than buy: rocksdb, forestdb, ... Transaction Control - read your own writes - multi-engine - undo log Engine 1 indexing checkpoint storage mgmt Write ahead log - redo log - async relay - group replication I/O - session - parser Engine 2 ...
  • 4. The shape of a classical LSM
  • 5. DELETE in an LSM tree We cant delete from append-only files Tombstones (delete markers) are inserted into L0 instead 26 Disk 3 8 15 26 35 40 45 48 10 25 36 42 22 37 Memory
  • 6. DELETE in an LSM tree Disk 3 8 15 26 35 40 45 48 10 25 36 42 22 26 Memory 37
  • 7. DELETE in an LSM tree Disk 3 8 10 15 22 35 36 37 Memory 40 42 45 4825
  • 8. SELECT in LSM tree Search in all levels, until the key is found Optimistic for point queries Merge sort in case of range select
  • 9. SELECT in LSM tree Disk 2 6 7 10 14 16 23 28 30 32 37 38 41 45 47 49 3 8 15 26 35 40 45 48 10 25 36 42 22 37 GET(16) Memory
  • 10. LSM: the algorithmic limit LSM tree B-tree Search K * O(log2 N/B) O(logB N) Delete O(log2 (N)/B) O(logB N) Insert O(log2 (N)/B) O(logB N)
  • 11. RUM conjecture The ubiquitous fight between the Read, the Update, and the Memory overhead of access methods for modern data systems
  • 12. Key LSM challenges in Web/OLTP Slow reads: read amplification Potentially write the same data multiple times: write amplification Keep garbage around: space amplification Response times affected by background activity: latency spikes
  • 13. Vinyl: memtable, sorted run, dump & compact Stores statements, not values: REPLACE, DELETE, UPSERT Every statement is marked by LSN Append-only files, garbage collected after checkpoint Transactional log of all filesystem changes: vylog key lsn op_code value
  • 14. Memtable, sorted run, dump & compact Disk 3 8 15 26 36 40 45 10 25 36 40 Memory memtable 22 37 dump sorted run compact 3 8 10 15 25 26 36 40 45
  • 15. Vinyl: read [25, 30) 3 8 15 26 1 25 36 15 26 29 4 25 26 29merge Uncommitted changes Tuple cache L0 Sorted run 83
  • 16. Keeping the latency within limits anticipatory dump throttling Disk Memory memtable for writes dump 3 8 10 15 25 26 36 40 45 22 37 22 37 shadow
  • 17. Reducing read amplification page index bloom filters tuple range cache multi-level compaction Disk22 37 3 8 10 15 25 26 36 40 45 3 15 22 36 22 25 26 page index tuple cachewrites 001010101010100110 110101011100000000 Bloom filter
  • 18. Reducing write amplification Multi-level compaction can span any number of levels A level can contain multiple runs Disk65 92 2 19 26 29 32 38 43 46 79 writes 14 15 35 89 22 23 27 31 37 46 65 78 94 3 9 28 33 44 47 48 54 56 59 61 66 75 81 93 94 96 98 99
  • 19. Reducing space amplification Ranges reflect a static layout of sorted runs Slices connect a sorted run into a range Disk65 92 2 3 9 14 15 22 14 15 35 89 23 27 28 31 33 37 44 81 93 94 96 98 Memory [-inf, 23) [-23, 45) [-45, +inf)
  • 20. Secondary key support L0 and tuple cache use memtx technology This makes it possible to include L0 and tuple cache in checkpoint, so no cold restart L1+ stores the value of primary key as tuple id Unique secondary keys rely on bloom filter heavily for INSERT, REPLACE REPLACE in a non-unique secondary is a blind write, garbage is pruned at compaction
  • 21. The scheduler The only active entity inside vinyl engine Maintains two queues: compact and dump Each range is in each queue, parallel compaction Dump always trumps compact Chunked garbage collection for quick memory reclamation Is a fiber, so needs no locks
  • 22. Transactions MVCC the first transaction to commit wins no waits, no deadlocks, only aborts yields dont abort transactions
  • 23. Replication & backup Work out of the box
  • 24. Limitations Tuple size is up to 16M You need ~5k file descriptors per 1TB of data Optimistic transaction control is bad for long read-write transactions No cross-engine transactions yet
  • 25. Configuration Name Description Default bloom_fpr Bloom filter max false positive rate 5% page_size Approximate page size 8192 range_size Approximate range size 1G run_count_per_level How many sorted runs are allowed on a single level 2 run_size_ratio Run size ratio between levels 3.5 memory L0 total size 0.25G cache Tuple cache size 0.25G threads The number of compaction threads 2
  • 26. UPSERT Non-reading update or insert Delayed execution Background upsert squashing prevents upserts from piling up space:upsert(tuple, {{operator, field, value}, ...})
  • 28. Links Leveled compaction in Apache Cassandra http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra http://www.scylladb.com/kb/compaction/ https://github.com/facebook/rocksdb/wiki/Universal-Compaction https://dom.as/2015/04/09/how-innodb-lost-its-advantage/