Build Your Own Key-Value Storage Engine—Week 3

Durability with Write-Ahead Logging

Dec 03, 2025

Monster Scale Summit

Curious how leading engineers tackle extreme scale challenges with data-intensive applications? Join Monster Scale Summit (free + virtual). It’s hosted by ScyllaDB, the monstrously fast and scalable database.

Agenda

Introduction

Last week, you built the first version of an LSM: an in-memory memtable for recent writes, immutable SSTables on disk, and a MANIFEST file listing the SSTable files. However, if the database crashes, data in the memtable would be lost.

This week, you will focus on durability by introducing Write-Ahead Logging (WAL). A WAL is an append-only file on disk that records the same operations you keep in memory. How it works:

On write, record it in the WAL and the memtable.
On restart, you read the WAL from start to end and apply each record to the memtable.

Introducing a WAL is not free, though. Writes are slower because each write also goes to the WAL. It also increases write amplification, the ratio of data written to data requested by a client.

Another important aspect of durability is when to synchronize a file’s state with the storage device. When you write to a file, it may appear as saved, but the bytes may sit in memory caches rather than on the physical disk. These caches are managed by the OS’s filesystem, an abstraction over the disk. If the machine crashes before the data is flushed, you can lose data.

To force the data to stable storage, you need to call a sync primitive. The simple, portable choice is to call fsync, a system call that flushes a file’s buffered data and required metadata to disk.

Your Tasks

💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server (#kv-store-engine channel):

Join the Discord

PUT

For the WAL data format, you won’t use JSON like the SSTables, but NDJSON (Newline-Delimited JSON). It is a true append-only format with one JSON object per line.

Append a record to the WAL file wal.db, opened with O_APPEND. Set the op field to “put”, and the key and value fields to the provided key and value. For example, writing k/v2:
```
{”op”:”put”,”key”:”k”,”value”:”v1”}
{”op”:”put”,”key”:”k”,”value”:”v2”} <- Latest write
```
fsync wal.db.
Update the memtable with the same logic as before:
- If the key exists, update the value.
- Otherwise, create a new entry.
Acknowledge the HTTP request.

Startup

Create an empty wal.db file if it doesn’t exist.
Replay the WAL from start to end. For each valid line, apply it to the memtable.

Flush

Keep the same flush trigger (2,000 entries) and the same logic (stop-the-world operation) as last week:

Write the new SSTable:
1. Flush the memtable as a new immutable JSON SSTable file with keys sorted (same as before).
2. fsync the SSTable file.
3. fsync the parent directory of the SSTable to make the new filename persistent.
Update the MANIFEST atomically:
1. Read the current MANIFEST lines into memory and append the new SSTable filename.
2. Open MANIFEST.tmp with O_CREAT | O_TRUNC | O_WRONLY.
3. Write the entire list to MANIFEST.tmp from the start.
4. fsync MANIFEST.tmp.
5. Rename MANIFEST.tmp → MANIFEST.
6. fsync MANIFEST.
7. fsync the parent directory of the MANIFEST.
Reset the WAL:
1. Truncate the WAL to zero length.
2. fsync the WAL file.

Client & Validation

If the server is unavailable, do not fail. Retry indefinitely with a short delay (or exponential backoff).

To assess durability:

Run the client against the same input file (put.txt).
Stop and restart your database randomly during the run.
Your client should confirm that no acknowledged writes were lost after recovery.

[Optional] Partial WAL Records

Add a per-record checksum to each WAL record. On startup, verify records and stop at the first invalid/truncated one, discarding the tail.

For reference, ScyllaDB checksums segments using CRC32; see its commitlog segment file format for inspiration.

[Optional] Dangling Files

Regarding the flush process, if the database crashes after step 1 (write the new SSTable) and before step 2 (update the MANIFEST atomically), you may end up with a dangling SSTable file on disk.

Add a startup routine to delete any file that exists on disk but is not listed in the MANIFEST. This keeps the data directory aligned with the MANIFEST after a crash.

Wrap Up

That’s it for this week! Your storage engine is now durable. On restart, data that was in the memtable is recovered from the WAL. This is made possible by fsync and the atomic update of the MANIFEST.

Deletion is not handled yet. In the worst case, a miss can read all SSTables, which quickly becomes highly inefficient.

In two weeks, you will add a DELETE endpoint and learn how SSTables are compacted so the engine can reclaim space and keep reads efficient.

Further Notes

In your implementation, you used fsync as a simple “make it durable now“ button. In practice, fsync offers finer control both over what you sync and when you sync.

What: fdatasync (or opening the file with O_DSYNC) persists the data without pushing unrelated metadata, which is usually what you want for WAL appends. You can go further with O_DIRECT + O_DSYNC to bypass the page cache and sync only the data you wrote, but that comes with extra complexity.
When: While calling a sync primitive after every request is offered by systems that promise durability, it is often not the default. Many databases use group commit, which batches several writes into one fsync call to amortize the cost while still providing strong guarantees. For additional information, see A write-ahead log is not a universal part of durability by Phil Eaton.
For example, RocksDB provides options for tuning WAL behavior to meet the needs of different applications:
- Synchronous WAL writes (what you implemented this week)
- Group commit.
- No WAL writes at all.

If you want, you can also explore group commit in your implementation and its impact on durability and latency/throughput, since this series will not cover it later.

Also, you should know that since a WAL adds I/O to the write path, storage engines use a few practical tricks to keep it fast and predictable. A common one is to preallocate fixed-size WAL segments at startup to:

Avoid the penalty of dynamic allocation.
Prevent write fragmentation.
Align buffers for O_DIRECT (an open(2) flag for direct I/O that bypasses the OS page cache).

The Coder Cafe: Learn One Concept With Your Coffee.

❤️ If you enjoyed this post, please hit the like button.

Comments

Ready for more?