Internal Storage Design of Modern Key-value Database Engines [Part 4]
Overview of the physical storage design implemented by many modern popular key-value stores such as Amazon Dynamo DB, Apache Cassandra, Riak, ScyllaDB, LevelDB and RocksDB.
In Part 1, Part 2 and Part 3 of this series, I covered the internal storage structure of modern key-value storage engines. In this post which is the last part of the series, some LSM-Tree based variations will be presented.
LSM-Tree Variations
The original LSM-Tree storage model has been implemented by many popular key-value based storage engines. However, recent research has attempted to address the model's limitations, such as high read and write amplification, by exploring possible software and hardware optimizations. One of the main areas of research has focused on adapting the model to take advantage of SSD devices, to reduce I/O overhead and increase overall performance.
Let’s look at some of the top researches done in this area.
LSM-Tree with Unordered Segments
This is an implementation of an LSM-Tree that only keeps keys sorted in the data structure, while actual records are appended to the disk-resident segment files in an unordered way. This storage model was presented in the WiscKey key-value store design in 2017 [1]. The main goal of the project was to improve the performance of LSM-Trees on SSD devices and exploit their I/O characteristics. It aims to preserve the benefits of LSM-Tree storage techniques, such as great insert and lookup performance, without the excessive I/O overheads.
Design Goals
The main goal is to minimize the I/O overhead and reduce write and space amplification in terms of RUM Conjecture a common characteristic in ordered LSM-Tree with SSTable which is justified when using magnetic HDDs.
In this model keys are separated from values and only keys are maintained in sorted LSM-Tree structure while keeping values (records) in separate segment log files.
As a result, sorting is decoupled from garbage collection and the Segment Compaction process by keeping records in unsorted append-only file segments. This significantly reduces the write amplification associated with sorted LSM-Tree implementations. The result is a more optimized and efficient garbage collection and compaction with reduced I/O overhead.
This is a useful model for use-cases with low rate of updates and deletes in which compaction will not free up much space [2].
Furthermore the size of the LSM-Tree is reduced due to only maintaining keys and pointers which leads to fewer disk reads and more efficient caching.
A lookup will likely search fewer levels of LSM-Tree files and additionally due to reduced size a significant portion of it can be cached in memory.
Based on benchmarks performed by WiscKey this implementation is faster than both LevelDB and RocksDB LSM-Tree based model for random lookups (1.6× to 14× faster) on SSDs [1].
Lookups & Range Queries
To retrieve key-value data, WiscKey searches the LSM-Tree consisting of keys and references to the values' locations. After finding the key, another read is issued to retrieve the value from the segment file. Since a good portion of the LSM-Tree can be cached in memory due to its reduced size, a lookup can be served with a single random seek to retrieve the value from the log.
To support range queries similar to SSTables, which are sorted by key, WiscKey takes advantage of the random I/O parallelism available in SSDs to amortize the cost of multiple seeks needed. As long as keys can be retrieved efficiently, range queries can be optimized through the use of a background thread pool of 32 threads for range queries.
Compaction & Garbage Collection
In this design, compaction only needs to sort keys in the LSM-Tree. Since keys are smaller than full records, this improves compaction efficiency and reduces I/O amplification and the amount of data being processed. Garbage collection is done by a background thread in a circular fashion. This thread reads from the tail of the log, checks the validity of records against the LSM-Tree, and writes back the valid records to the head of the file. Unordered Segment Compaction is also more efficient by sequentially merging the segments into larger files without requiring iterative merge-sort. Finally, pointers are updated to point to the new location.
SSD Optimisation
The original design goal of WiscKey was to improve the performance of LSM-Tree on SSD devices by reducing the read and write amplification associated with common LSM-Tree implementations. This I/O optimization comes at the cost of requiring random I/O lookups for range queries. SSDs provide a high degree of internal parallelism, which can be exploited in storage design, and the aggregate random read throughput achieved can match that of sequential read [1].
Therefore, to counter the negative effect on random I/O cost associated with range scans, WiscKey attempts to match the I/O patterns with the random and sequential performance characteristics of SSDs, using internal SSD parallelism to prefetch blocks in parallel and fully utilize device bandwidth for range lookups.
Limitation & Challenges
Although the proposed technology can deliver good performance for most real key-value use cases when implemented on SSDs, its performance degrades for use cases with many small values are written in random order, or when access patterns involve large sequential range queries. In these cases, it may perform worse than existing LSM-Tree implementations, such as LevelDB [1]. Additionally, the separation of keys and values introduces new challenges for compaction, crash recovery, and consistency.
LSM-Tree on Open-Channel SSD
A research study conducted by Wang, Peng, et al. [3] focused on implementing the LSM-Tree storage model on Open-Channel SSDs, named LOCS for short. The goal was to optimize the I/O throughput of LSM-Tree read and write operations by taking advantage of the multi-channel I/O architecture of NAND Flash-Based SSDs. The study extended LevelDB to explicitly leverage the internal SSD's multi-channel capabilities, achieving higher I/O parallelism, particularly in optimizing random read latency.
The existing SSD only exposes a single block controller interface for I/O operations. Therefore, it needed to be modified to expose the SSD’s internal multi-channels and parallelism to the application through a customized controller and skip the default flash translation layer (FTL).
The new interface presents each channel as an independent device to the applications (e.g., from /dev/ssd0 to /dev/ssd43). This allows for direct access to individual flash channels. To schedule and dispatch I/O requests from the engine to available SSD channels, a scheduler was added between the LevelDB engine and the Storage API.
According to the research, the I/O throughput for read and write requests, as well as the number of operations per second (OPs), improved up to three times compared to LevelDB. This improvement allows multiple compaction jobs to run in parallel, improving compaction performance and throughput. On average, throughput increased from 221 MB/s to about 361 MB/s [3].
Baidu's data centers (a Chinese internet search company) adopted the presented design. However, Dong, Siying, et al. [4] argued that these types of SSD optimizations for LSM-Tree based engines would only benefit a small number of applications, as most applications have space constraints on SSDs, rather than latency or erase-cycle constraints.
Limitations, Opportunities and Conclusion
The LSM-Tree storage model provides excellent write throughput and latency for both random and sequential writes. It also covers range scans well. However, random reads are the slowest operation, as benchmarked by Google [5].
Read operations are also slowed down due to the reconciliation process required when multiple versions of the data are present. LSM-Tree fundamentally doesn't support low-latency lookups, so other techniques need to be implemented to reduce the cost of random reads for read-heavy and latency-sensitive use-cases.
Overall, LSM-Tree is associated with I/O read and write amplification, based on the presented RUM Conjure cost model in this series. This is due to addressing multiple files for lookups and periodically rewriting the same data multiple times during the Segment Compaction process throughout its lifetime. However, this is a justified trade-off, especially on HDDs, for achieving high random write performance and throughput in applicable use-cases. The write amplification increases as the dataset grows larger, since the records will likely go through more compaction cycles and traversing up the levels.
That being said, LSM-Tree is still a very popular storage model among distributed, non-distributed, and even embedded key-value stores, especially for applications associated with high insert rates. Facebook Inbox is a great example of such an application for which Apache Cassandra was originally developed [6]. According to the published paper, the application was required to handle a very high write throughput of billions of writes per day, supporting millions of users and growing.
Today, LSM-Tree-based key-value storage engines are deployed to production, backing small to very large data-intensive applications.
References
[1] Lu, Lanyue, et al. "Wisckey: Separating keys from values in ssd-conscious storage." ACM Transactions on Storage (TOS) 13.1 (2017): 1-28.
[2] Petrov, Alex. Database Internals: A deep dive into how distributed data systems work. O'Reilly Media, 2019.
[3] Wang, Peng, et al. "An efficient design and implementation of LSM-tree based key-value store on open-channel SSD." Proceedings of the Ninth European Conference on Computer Systems. 2014.
[4] Dong, Siying, et al. "Rocksdb: Evolution of development priorities in a key-value store serving large-scale applications." ACM Transactions on Storage (TOS) 17.4 (2021): 1-32.
[5] Chang, Fay, et al. "Bigtable: A distributed storage system for structured data." ACM Transactions on Computer Systems (TOCS) 26.2 (2008): 1-26.
[6] Lakshman, Avinash, and Prashant Malik. "Cassandra: a decentralized structured storage system." ACM SIGOPS operating systems review 44.2 (2010): 35-40.