☕ Welcome to The Coder Cafe! Some people reached out after I published the Build Your Own Key-Value Storage Engine series to say they hadn’t gone through all eight posts, but they were curious about the core ideas. So I distilled everything into a single post. No implementation, no exercises, just the core concepts behind LSM trees. Get cozy, grab a coffee, and let’s begin!
Fundamental Insights
To understand LSM trees, we first need to understand why writes are hard.
A B-tree-based database updates data in place. When we write a key, the engine finds the right page on disk and modifies it. This is a random write: the disk head has to seek to an arbitrary location before writing. On spinning disks, that seek takes time. But even on SSDs, random writes cause problems: they wear out cells unevenly and trigger expensive internal garbage collection.
LSM trees take a completely different approach. Instead of writing data where it ultimately belongs, they write data sequentially. Writes are recorded in memory and appended to a log file for durability. When the in-memory buffer fills up, its contents are streamed to a new file in one sequential pass. Sequential writes are dramatically faster than random writes because there is no seeking involved. The disk just keeps writing forward.
The price of this design is complexity. Data doesn’t live in one place. It accumulates across multiple files over time, and those files need to be periodically merged and reorganized in the background to stay manageable. That background work is what every piece of an LSM tree is built around.
The in-memory buffer is called the memtable. The sorted files on disk are called SSTables. We’ll look at each in detail.
The Memtable
Every write in an LSM tree starts in memory, in a structure called the memtable.
The memtable is a mutable, in-memory store. When a write request arrives, the engine records the key-value pair in the memtable and appends it to a sequential log file on disk (called the write-ahead log, or WAL, which we’ll cover in the next section). The WAL write is a sequential append, so it is fast. There is no random I/O, no page lookup, no in-place modification. This is why LSM trees can sustain very high write throughput.
A hashtable works for lookups but not for in-order iteration. Sorting a hashtable takes O(n log n) at flush time. A better choice is an ordered data structure. The most common in practice is a skip list; for example, LevelDB and RocksDB both use one as their default. A radix trie is another elegant option: it keeps keys in lexicographic order naturally, so iterating in order is just a depth-first traversal, and flushing becomes a simple stream with no sorting step needed. A balanced BST works too.
Production implementations typically attach a monotonic sequence number to each entry, so the engine can always determine which version of a key is the most recent, regardless of arrival order.
The memtable doesn’t grow forever. At some point, it gets flushed to disk, and a new empty memtable takes its place. What triggers that flush depends on the implementation: it can be a size limit (a number of entries or a memory threshold), elapsed time, or memory pressure, for example. That flush produces a sorted file on disk called an SSTable, which we’ll look at after the WAL.
The Write-Ahead Log
There is a problem with keeping writes in memory: if the process crashes, everything in the memtable is gone. Any write the client received an acknowledgment for is now lost. That breaks a core database guarantee: durability.
The solution is a Write-Ahead Log, or WAL.
Before writing to the memtable, the engine appends the operation to the WAL, an append-only file on disk. Only after the WAL entry is safely persisted does the engine update the memtable and acknowledge the client. This ordering is what the “write-ahead” in the name refers to: the log is always written before the in-memory state changes.
The WAL is not the final home for data; it’s a safety net. If the engine crashes and restarts, it replays the WAL from the beginning to reconstruct the memtable, recovering any writes that hadn’t been flushed to disk yet.
One subtlety: writing to a file is not the same as persisting it. Operating systems buffer writes in memory before flushing to disk. To guarantee durability, the engine must call fsync() after each WAL entry, forcing the OS to flush its buffers to physical storage. This is not free, though. fsync() adds latency to every write. Production systems often use fdatasync() instead, which persists the data without flushing unnecessary file metadata, keeping WAL appends faster. Many also use a technique called group commit to amortize this cost further: instead of syncing after every write, they batch multiple WAL entries and call fsync() once for the group.
The WAL introduces write amplification: the ratio of data written to disk versus data actually requested by a client. Every byte we write to the database gets written to disk twice: once to the WAL immediately, and once to an SSTable when the memtable is eventually flushed. That cost buys us durability.
SSTables
As we said, when the memtable fills up, it gets written to disk as a Sorted String Table, or SSTable.
An SSTable is an immutable, sorted file. Immutable means it is never modified after creation. Sorted means keys are stored in lexicographic order. Both properties matter:
Immutability makes SSTables safe to read concurrently without locking.
Sorted order makes lookups inside a file efficient.
In a simple implementation, an SSTable is just a JSON array of key-value pairs, sorted by key:
[
{ “key”: “apple”, “value”: “1” },
{ “key”: “banana”, “value”: “2” },
{ “key”: “cherry”, “value”: “3” }
]Production systems use a binary block-based format instead. The SSTable is divided into fixed-size blocks, typically 4 KB, though the exact size varies by implementation. Data blocks hold the actual key-value entries. The SSTable also contains an index block storing the first key of each data block, which makes it possible to binary search for the right block without reading the entire file. In most implementations, the index block is written at the end of the file, since block boundaries are only known after all data blocks have been streamed out. To look up a key, we read the index block, binary search it to find the right data block, fetch that single block from disk, verify its integrity with a checksum, and then binary search within the block. When the index block is not cached, this means most lookups read two disk pages: the index block and one data block. In practice, index blocks are typically kept in memory, so most lookups require only one disk read.
Each data block also carries a checksum computed over the block’s bytes. Before using the data, the engine verifies the checksum. If they don’t match, the block is corrupted, and the read fails safely rather than returning garbage.
As SSTables accumulate, the engine maintains a catalog file (often called a MANIFEST in systems like RocksDB), which is an append-only log listing all existing SSTables in order of creation. This catalog is the engine’s source of truth for what files exist on disk. On startup, the engine reads it to know which files are live, and replays the WAL to restore the memtable. After a successful flush, the old WAL can be discarded. The data is now safely in an SSTable.
Production systems also compress data blocks, typically with a fast algorithm like Snappy, LZ4, or zstd. Compression reduces disk footprint and I/O at the cost of CPU, and it interacts with block sizing: a compressed block may be smaller than a disk page, so implementations often track both logical and physical block sizes.
Reads and Read Amplification
LSM trees are optimized for writes. Reads are where the trade-off shows.
To look up a key, the engine searches in order of recency: first the memtable, then SSTables from newest to oldest. The first match wins. This ordering matters because the same key can appear multiple times across different SSTables. Each write to a key produces a new entry rather than updating the existing one. The newest version is the correct one.
The problem becomes clear as SSTables accumulate. A key that was written once and never updated might still require the engine to search through dozens of SSTables before finding it, or confirming it doesn’t exist. Each SSTable search is a disk read. This is called read amplification: a single logical read triggers multiple physical reads.
For a key that doesn’t exist at all, the engine must check every SSTable before returning a not-found error. That’s the worst case for read amplification, and it gets worse the more SSTables there are.
This is a fundamental tension in LSM trees, and it reflects a deeper principle known as the RUM conjecture: a storage engine can excel at two of reads, updates, and memory efficiency, but not all three at once. LSM trees make a deliberate choice: optimize for updates, accept read amplification as the cost.
The sorted structure also enables efficient range scans. To retrieve all keys between start and end, the engine scans the memtable in order, then merges sorted streams from the relevant SSTables.
Compaction
The answer to accumulating SSTables is compaction.
Compaction is a background process that takes multiple SSTables, merges them into fewer, cleaner ones, and discards the originals. The result is fewer files to search through, which directly reduces read amplification. It also reclaims disk space consumed by redundant entries: if the same key appears in three different SSTables, compaction keeps only the newest version and discards the rest.
One common algorithm is a k-way merge. The engine opens iterators over all SSTables being compacted, each positioned at the first entry. It uses a min-heap to always pull the smallest key across all iterators. When the same key appears in multiple SSTables, the engine picks the version from the newest SSTable and discards the older ones. The merged output is streamed into new SSTable files. In practice, real systems limit the number of SSTables that can participate in a single compaction run to keep resource consumption under control.
Updating the catalog after compaction requires care. The engine must not delete the old SSTables before the new ones are safely written to disk. The safe sequence is: write new SSTables, fsync, write a new catalog pointing to the new files, fsync, then delete the old SSTables. A crash at any point leaves the engine in a recoverable state: either the old files are still referenced by the old catalog, or the new files are referenced by the new catalog.
Compaction is not free. It consumes I/O and CPU in the background, competing with foreground reads and writes. Every byte of data gets rewritten multiple times across its lifetime, adding to write amplification. Tuning when compaction triggers (and how aggressively it runs) is one of the main knobs in LSM tree performance.
Deletions and Tombstones
We might expect deletion to be straightforward: find the key, remove it. In an LSM tree, it is anything but straightforward.
SSTables are immutable. We cannot reach into an existing SSTable and remove an entry. So when a key is deleted, the engine writes a special marker to the memtable called a tombstone, an entry that says “this key is deleted”. It eventually gets flushed to an SSTable like any other write.
During reads, the engine respects tombstones. If a tombstone for a key is found before a value for that key, scanning newest to oldest, the key is treated as deleted, and a not-found error is returned. The tombstone shadows any older value.
The tricky part is knowing when it is safe to discard a tombstone during compaction. Consider this situation: a tombstone for key foo exists in a newer SSTable, and an old value for foo exists in an older SSTable that hasn’t been compacted yet. If we drop the tombstone during compaction without also removing the old value, the old value becomes visible again. Deleted data reappears. This is called data resurrection, and it is a correctness bug.
NOTE: Correctness here means the engine returns what was actually written, not a stale or deleted value. This is different from consistency in the distributed systems sense, which describes the guarantees clients have about which version of data they see across replicas.
The rule is strict: a tombstone can only be dropped when the engine can guarantee that no older value for that key exists anywhere below it on disk. In practice, this means the compaction must include the oldest SSTables that could still hold a shadowed value.
This is one of those details that seems minor until we get it wrong. A storage engine that resurrects deleted data is not a storage engine we can trust. Getting this right requires knowing exactly where older values can hide, which brings us to how SSTables are organized on disk.
Leveling
Basic compaction, merging all SSTables into one flat pool, works but doesn’t scale. As the dataset grows, a flat pool of SSTables means reads still have to check many files. Leveling is the structural answer.
In a leveled LSM tree, SSTables are organized into levels: L0, L1, L2, and so on. Each level has different rules:
L0is the landing zone. When the memtable flushes, the resulting SSTable lands in L0.L0files can have overlapping key ranges: two L0 files might both contain entries for keyfoo. This is acceptable because L0 files are small and short-lived.L1and deeper levels are different. Each level maintains non-overlapping key ranges across all its files. A given key can exist in at most one file per level. This is the critical property that makes reads efficient: to look up a key inL1, we don’t scan allL1files. We use the key ranges to jump directly to the one file that could contain it.
When L0 accumulates enough files, a compaction runs to merge L0 into L1. This merge enforces the non-overlapping invariant: L0 files (which may overlap) get merged with the relevant L1 files (which define the ranges), producing new L1 files with clean, non-overlapping ranges. Similarly, when L1 grows too large, a compaction merges part of L1 into L2.
Each deeper level is typically larger by a fixed ratio, for example, 10x. L1 might hold 10 MB, L2 100 MB, L3 1 GB, and so on. Most data ends up in the deepest level. Most compaction work happens between levels.
The benefit is controlled read amplification. To look up a key, we check the memtable, scan all L0 files, then do one binary search per deeper level. The number of deeper levels grows logarithmically with data size. For a dataset with a few levels, that’s a small, bounded number of disk reads, regardless of how many total SSTables exist.
When compaction falls behind and L0 accumulates too many files, the engine may trigger a write stall: new writes are paused until compaction catches up and L0 is drained. This is one of the more painful operational issues in LSM-based systems.
Leveled compaction is also not the only strategy. Tiered compaction, used by Cassandra, for example, takes a different approach: instead of enforcing non-overlapping ranges per level, it groups SSTables of similar size and merges them when a tier grows too large. Tiered compaction generates less write amplification but more read amplification. The right choice depends on the workload.
Bloom Filters
Leveling helps with reads, but there is still one painful case: looking up a key that doesn’t exist.
For a missing key, the engine checks the memtable (not there), checks each L0 file (not there), then checks one file per deeper level (not there). Each check is a disk read. Even with leveling, this adds up.
Bloom filters solve this. A Bloom filter is a probabilistic data structure that can answer one question: Is this key definitely not in this SSTable? It has no false negatives: if the key is in the SSTable, the filter will say so. It can have false positives (occasionally it says a key might be present when it isn’t), but in practice, the false positive rate is tunable and kept very low.
Many implementations attach a Bloom filter to each SSTable, built at creation time from all the keys it contains. The filters are small, a few kilobytes per SSTable, so they can be loaded into memory at startup and kept there.
How does it work? A Bloom filter is a bitset. When a key is added, several hash functions are applied to it, each producing an index into the bitset. The bit at each index is set to 1. To check if a key is in the filter, the same hash functions are applied. If any of the resulting bits is 0, the key is definitely not in the SSTable. No disk read needed. If all bits are 1, the key might be there, and the engine proceeds to read the SSTable.
The practical impact is significant. For a key that doesn’t exist (the worst case), the engine skips almost every SSTable without a single disk read. Only the rare false positive triggers an unnecessary disk read. Read amplification for missing keys drops dramatically.
Some engines take this further and attach Bloom filters not just per SSTable but per data block within an SSTable, enabling even more precise filtering before fetching a block from disk.
Concurrency
Everything described so far assumes a single thread. In reality, a storage engine needs to handle concurrent reads and writes, while flush and compaction run in the background. This is where things get subtle.
The core problem: a flush operation replaces the current memtable with a new one and registers a new SSTable in the catalog. A compaction operation removes old SSTables and registers new ones. If a read is in the middle of searching an SSTable that gets deleted by a concurrent compaction, that’s a crash.
One common solution is a versioned catalog.
A catalog is a snapshot of the engine’s state at a point in time: a reference to the current memtable, the current WAL path, and the current catalog file. Every incoming request acquires the latest catalog version, pins it by incrementing a reference count, performs its work, then releases it by decrementing the reference count.
Background workers (the flush worker and the compaction worker) never modify an existing catalog. Instead, when a flush or compaction completes, they create a new catalog version pointing to the updated memtable and SSTable set. From that moment, new requests acquire the new catalog. Old requests that pinned the previous catalog continue reading from it safely.
An old catalog version is only cleaned up (its SSTables deleted, its WAL file discarded) when its reference count drops to zero. No reader is using it anymore, so it is safe to remove.
This approach keeps foreground reads and writes lock-free in the hot path. Background operations never block requests, and requests never block background operations. They operate on independent catalog versions and only synchronize at the moment of catalog swap, which in many implementations is a single atomic pointer update.
The versioned catalog is also what makes crash recovery clean. On startup, the engine reads the latest catalog file on disk, which always reflects a consistent state: either from before the last flush/compaction, or after. Any SSTables on disk not referenced by the catalog are orphans from an incomplete operation and can be safely deleted.
Summary
LSM trees optimize for write throughput by turning random disk writes into sequential ones, at the cost of more complex reads.
The memtable absorbs writes in memory; an ordered structure like a skip list, balanced BST, or radix trie keeps keys sorted for efficient flushing.
The WAL provides durability: every write is logged to disk before the memtable is updated, enabling crash recovery.
SSTables are immutable, sorted files produced by flushing the memtable; a binary block format with checksums makes point lookups efficient and reads safe.
A catalog file tracks which SSTables are live and is updated atomically to ensure the engine always has a consistent view of disk state.
Read amplification is the fundamental trade-off: finding a key may require searching multiple SSTables, one per level, plus all
L0files.Compaction merges SSTables, eliminates redundant entries, and reclaims space, at the cost of write amplification and background I/O.
Tombstones handle deletions in an immutable structure; they can only be discarded when no older value they shadow still exists on disk.
Leveling organizes SSTables into levels with non-overlapping key ranges, bounding read amplification to one file lookup per level. Tiered compaction is an alternative strategy that trades less write amplification for more read amplification.
Bloom filters allow the engine to skip SSTable reads for missing keys with near certainty, eliminating the worst-case read scenario.
A versioned catalog is one common approach to enabling lock-free concurrent reads and background operations by letting each request pin a consistent snapshot of engine state.
Resources
More From the Distributed Systems Category
Sources
The Log-Structured Merge-Tree (LSM-Tree) // The original LSM tree whitepaper.
Log Structured Merge Tree - ScyllaDB // LSM tree definition from ScyllaDB technical glossary.






