Internal Storage Design of Modern Key-value Database Engines [Part 3]

Overview of the physical storage design implemented by many modern popular key-value stores such as Amazon Dynamo DB, Apache Cassandra, Riak, ScyllaDB, LevelDB and RocksDB.

Aug 13, 2023

In Part 1 and Part 2, I covered the internal storage structure for modern key-value storage engines and it’s origin. In this part a major background operation without which the full data lifecycle management wouldn’t be completed will be introduced. Next a cost model for classifying storage systems and its application to LSM-Tree storage model will be presented. At the end of this section some performance optimization techniques will be briefly discussed.

Segment Compaction

Compaction is the process of combining segment data files for reclaiming space usually conducted by the storage engines, or as part of the data optimization process.

If you recall the presented LSM-Tree model all operations such as updates and deletes are treated as inserts due to immutable nature of the underlying data files. Therefore compaction is a common housekeeping and optimization operation performed in LSM-Tree based Storage Engines and any Immutable Storage Structure to merge and clean segments.

To summarize compaction is usually performed in order to:

Removing redundant data such as data marked as deleted
Remove duplicate keys by eliminating the outdated versions of records and keeping only the most recent version.
Reclaiming unused fragmented space
Merge smaller segments into larger files.

It must also be noted that compaction is an intensive operation and usually puts pressure on CPU and disk. Therefore it is usually scheduled in the background during a suitable time to minimize impact on the performance of the engine.

SSTable Compaction

Since SSTables are sorted data files, compaction is more straight-forward using a suitable iterative merge and reconcile algorithms such as merge-sort algorithms to merge multiple sorted SSTable files in LSM-Tree style structures.

During the Merge and compaction process duplicate keys are removed and only the latest record is saved in the new merged SSTable file.

Dealing with Deletes

To perform deletes Tombstone technique is usually used in such Immutable Storage Structures in which in-place updates are not allowed, by inserting a special marker as an explicit hint that a key associated with a record is to be deleted, instead of physically removing the record at the request time.

During read requests the data is reconciled and the records associated with tombstones are filtered out. The tombstones and referenced records are physically removed in periodic Segment Compaction process.

For range delete requests (in the form of WHERE key ≥ x AND key < y), predicate deletes are used which cover a range of keys that match the predicate to be deleted. In Apache Cassandra this is called Range Tomstones [1].

Compaction Algorithms

There are few common compaction mechanisms used by LSM-Tree based key-value stores which will be briefly discussed in this section.

Leveled SSTable and Leveled Compaction

Leveled SSTable is a SSTable layout and organization technique on disk in which segment files are assigned to different levels of different sizes, each level having a sequence number.

The layout is as follows:

The lowest level which is level-0 contains segment files generated from flushing the content of the in-memory buffer to generate data files on disk.
Once the number of files on level-0 reaches a threshold, they are both merged and split into ranges by the compaction process to move to upper level-1.
As soon as the number of segments on level-1 and higher levels reaches a threshold, the compaction process picks segments from the current and the higher level with overlapping key ranges (where Key Range partitioning technique is used) and produces new larger segment files on the higher level.
Keys do not overlap between segments within each level except for level-0.
There is a limit on the size of the segment files on each level with the limit increasing at each upper level.

RocksDB and LevelDB implement the presented leveled layout. LevelDB maintains seven SSTable levels (L0 to L6) with size of the segments increasing in each level by a factor of 10 [2].

Leveled Compaction Implications

The implication of such design is that number of file searches for a lookup request is bounded by the number of levels except for level-0 where multiple files might be checked due to key overlap between files at this level. Therefore the lookup latency is increased (read amplification). There is a also major write amplification associated with the levels as files need to be rewritten as the data moves up the levels specially as the size of the database grows causing records to move up the levels many times during compaction process.

Due to such high write amplification (See RUM Conjure in the the next section) of leveled compaction, it is recommended to use when the rate of reads are higher than writes (Datastax recommends when ration is 2:1 [3]) and workloads involving high random reads, otherwise the performance penalty might not worth the benefit.

Size-tiered Compaction

In Size Tiered Compaction SSTable segments are grouped based on their size. It’s a more lazy compaction strategy compared to Leveled Compaction and the primary aim is to reduce the write amplification associated with Leveled Compaction.

In this compaction strategy similar size SSTables are put into different buckets and the major compaction is triggered when the number of files exceeds a threshold or the size ration exceeds a specific threshold. Unlike leveled compaction, SSTables in the same bucket don’t preserve unique key-ranges therefore this strategy results in higher read amplification.

This strategy is more suitable for write-heavy and use-cases where ratio of inserts are higher than reads since there is a read performance penalty and read-heavy use-cases would suffer from it. Apache Cassandra and RocksDB support size tiered compaction.

LSM-Tree and Associated RUM Conjecture

RUM Conjecture is a tradeoff and cost model presented by Athanassoulis, Manos, et al [4] that takes three factors of Read, Update and Memory overheads into account, by stating that optimizing and reducing two of these overheads will result in amplification and worsen the third one. This ultimately means that any optimization can be done at the expense of one of these factors. Tradeoffs between these three factors are usually balanced in each system differently based on their design goals.

In the context of our immutable LSM-Tree storage model, RUM Conjecture is used to provide a cost model with the following three factors:

Read Amplification — The overhead of reads as a result of requiring addressing multiple SSTable segment files to lookup records.
Write Amplification — The overhead of writes associated with continuous rewrite of segment files in compaction process.
Space Amplification — The overhead of disk space associated with storing multiple versions of the same key.

LSM-Tree based storage model is characterized with more optimized space amplification as compared to B-Tree storage models, while having poor read and write amplification with respect to RUM Conjecture:

Space Amplification

Due to log-structured nature of physical presentation, LSM-Tree model is more space efficient than B-Tree model.
B-Tree structures suffer from higher space amplification due to fragmentation and fixed-size page with unused spaces, often half or 2/3 full, which causes the space amplification to be worse than 1.5 in B-Tree based structures [5].
On the contrast LSM-Tree doesn’t suffer from fragmentation due to immutable and append-only nature of the data files on disk.
The only space amplification associated with LSM-Tree is the stale and out-dated data which is yet to be removed by the compaction process.
Facebook’s RocksDB shows only 1.11 amplification of lower level data files compared to the final data files in the last SSTable level [5].
RocksDB tries to reduce space amplification further by adopting the size of the levels to the size of the data and applying different suitable compression strategies at different levels.

Write Amplification

While LSM-Tree based storage engines are optimized for excellent random write performance, they have relatively high write amplification due to data files usually being rewritten multiple times in their lifecycle when using Leveled SSTable schema until then end up in the last level.
This could be reduced by using fewer number of levels, however the lower the number of levels the higher the read latency and amplification.

Read Amplification

LSM-Tree is characterized with slower random read performance due to it’s on-disk architecture of maintaining and having to check multiple data files in order to lookup a key.
Therefore pure LSM-Tree by default has a high read amplification if no optimization techniques are implemented. To lookup a key, first the Memtable is checked, followed by recent data files in level-0, then level-1 and so on until all files are checked if the key is non-existent.
Various techniques such as block cache, block-level index, Bloom Filter and SSTable Manifest File are implemented by various LSM-Tree based Storage Engines to minimize the read amplification and improve the random read latency.

Overall LSM-Tree based storage model is associated with high read and write amplification due to it’s fundamental storage design.

Performance Optimization

To enhance the read and write performance of SSTables, various optimizations have been implemented by different engines such as:

As mentioned an Index Block is calculated and stored in SSTable (with one index entry per block) for binary search and faster key lookup.
Block cache is implemented by various engines such as LevelDB and RocksDB to serve recently accessed blocks from memory and avoid disk access.
Bloom Filter is used and usually loaded in memory for improving lookup latency and reduce read amplification.
Combining SSTables with an in-memory Hash Map Index for faster key lookup.
A SSTable Manifest File is used in engines such as LevelDB and RocksDB for optimizing lookup performance.

Some of the implemented optimizations with be briefly discussed.

SSTable Manifest File

A Manifest File for SSTable was introduced by LevelDB [5] for optimizing random lookup performance. The Manifest File maintains a list of all SSTables at all levels along with their respective key ranges, cached in memory to enable fast identification of SSTables which may contain a lookup key. It is maintained as a log to which the updates to SSTable is appended. It can also be cached in memory to fast scanning and update.

SSTable-Attached Secondary Indexes (SASI)

SSTable-Attached Secondary Indexes (SASI) is a secondary indexing technique for SSTables to index any fields (columns) in addition to primary key.

A secondary index structure is created per SSTable with the same lifecycle as the referencing SSTable. Once the Memtable Buffer Flushing process is completed, the secondary index is generated along with the SSTable. During compaction process SASI might be merged or discarded following the referenced SSTable’s lifecycle path. A Separate SASI is also maintained for the Memtable resident records [6].

SASI Index was originally developed by Apache Cassandra to be able to index columns.

Using Hash-Map Index

By combining append-only SSTable segments with Hash Map Index the model can be improved to provide faster reads by means of key lookup in the in-memory hash map index and at most one disk seek operation (if segment is already paged by OS, no disk seek is needed) without requiring to sequentially scan all the files to find the key. Bitcask, the default storage engine for Riak implements this technique [7].

However there are some limitations:

The hash map index for the entire data should fit in memory. This will not be feasible for large datasets. For fast access only single or multiple key lookup is supported and other access patterns such as range queries, and analytical queries scanning large set of records are not supported.
Range Queries are not efficient since keys are not sorted and each key in the range has to be looked up in the hash index and segments individually [8].
Concurrent writes to a segment is not supported since only one write head can perform update (append) in the file.

In the next final part, some popular variations and optimized implementations of the storage model is be presented.

References

[1] Lakshman, Avinash, and Prashant Malik. "Cassandra: a decentralized structured storage system." ACM SIGOPS operating systems review 44.2 (2010): 35-40.

[2] Lu, Lanyue, et al. "Wisckey: Separating keys from values in ssd-conscious storage." ACM Transactions on Storage (TOS) 13.1 (2017): 1-28.

[3] https://docs.datastax.com/en/dse/5.1/dse-dev/datastax_enterprise/config/configChooseCompactStrategy.html

[4] Athanassoulis, Manos, et al. "Designing Access Methods: The RUM Conjecture." _EDBT_. Vol. 2016. 2016.

[5] Dong, Siying, et al. "Optimizing Space Amplification in RocksDB." CIDR. Vol. 3. 2017.

[6] Petrov, Alex. Database Internals: A deep dive into how distributed data systems work. O'Reilly Media, 2019.

[7] Sheehy, Justin, and David Smith. "Bitcask: A log-structured hash table for fast key/value data." _Basho White Paper_ (2010).

[8] Kleppmann, Martin. "Designing Data-Intensive Applications: The Big Ideas Behind Reliable." Scalable, and Maintainable Systems 1st Edition–O'Reilly Media (2017).

Practical Data Engineering

Discussion about this post