Atomicity and durability are two of the basic guarantees a database has to provide (see also ACID). Atomicity means a transaction is all-or-nothing: either all of its updates take effect, or none of them do. Durability means that once a transaction commits, its changes must survive a crash.

These requirements sound like they could be enforced by direct writes to disk: committed data should be on disk, and uncommitted data should not be. But databases do not write every change directly to disk as it happens. They route reads and writes through an in-memory cache of data pages called the buffer pool.

When a query needs a row, the database reads the containing page from disk into the buffer pool. When a transaction modifies that row, it changes the in-memory page and marks it dirty. At that point, the page in memory is newer than the page on disk. The buffer pool manager then has to decide when that dirty page should be written back.

The safe (and slow) choice

The simplest way to preserve atomicity and durability is to make the buffer pool follow two strict rules: no-steal and force.

No-steal means the buffer pool is not allowed to write pages dirtied by uncommitted transactions to disk. As long as a transaction is still in progress, its dirty pages must stay in memory. This makes atomicity easy. If the transaction aborts, the database can discard the in-memory changes, knowing that none of them reached the data files.

Force means the opposite kind of strictness at commit time. Before a transaction is allowed to commit, every page it dirtied must be written to disk. This makes durability easy. Once the client is told the transaction committed, the data files already contain the transaction’s changes.

Together, no-steal and force give a clean mental model: uncommitted changes never reach disk, and committed changes are always on disk before commit returns. In that world, crash recovery has little to do.

The problem is that this policy performs badly. No-steal ties the buffer pool’s memory usage to the size and duration of active transactions. A transaction that dirties many pages must keep all of those pages resident until it finishes. If a long-running transaction modifies more pages than the buffer pool can hold, the database cannot safely evict them, because writing them out would put uncommitted data on disk. Force creates a different bottleneck. A transaction’s updates may be spread across many unrelated data pages. Requiring all of those pages to be written at commit time turns commit into a collection of scattered data-file writes. Commit latency becomes tied to the number and physical location of the pages the transaction touched.

So the strict policy is easy to reason about, but too limiting for a practical system.

The practical choice: steal and no-force

To get better throughput, database systems usually want the opposite freedoms: they want to be able to evict dirty pages before the transaction that dirtied them commits, and they want to commit transactions without forcing all of their dirty data pages to disk immediately. Those two choices are called steal and no-force.

With steal, the buffer pool may write an uncommitted dirty page to disk when it needs the memory. This prevents active transactions from pinning too much of the buffer pool, but it creates an atomicity problem. If the transaction later aborts, the data files may already contain changes that should not persist.

With no-force, commit does not wait for the transaction’s dirty data pages to be written to the data files. Those pages can remain in memory and be flushed later. This avoids making commit depend on scattered data-page writes, but it creates a durability problem. If the server crashes after commit but before those pages are flushed, the data files may be missing changes from a transaction that already committed.

Restoring atomicity: undo logging

Steal creates the need for undo: if an uncommitted change reaches disk, the database must be able to remove it. For every change a transaction makes to a page, the system writes a log record describing exactly what was on the page before the modification. This is enough information to reverse the change later.

If the transaction aborts, the database applies its undo records in reverse order. Each step restores part of the database to the state it had before the transaction made its change. The same idea applies after a crash. Any transaction that was still in progress at the time of the crash is treated as aborted. Some of its changes may already have reached the data files because the system uses steal. Recovery uses the undo log to remove those changes.

Restoring durability: redo logging

No-force creates the need for redo: if a committed change has not reached disk, the database must be able to repeat it. For every modification a transaction makes, the system writes a log record describing the new change. On a clean shutdown, these records are ignored because the buffer pool eventually writes all dirty pages to disk anyway.

But if the database crashes, the redo log is what catches the data files up. On restart, recovery walks the redo log forward and reapplies every record from a committed transaction whose changes hadn’t yet reached the data files. The data files catch up to exactly where they were in memory right before the crash.

This is the key performance tradeoff. Rather than forcing many scattered data pages to disk at commit time, the database writes sequentially to the log. A transaction may touch pages spread across the data file, but its log records are appended to one place. Commit can therefore be reduced to flushing the log far enough to make the transaction durable.

What’s in a log record

A log record has to describe a page change in enough detail to redo or undo it. There are three common ways to do that, and they trade off compactness against how much state recovery has to assume.

Logical logging describes the change in terms of the operation: insert this row, delete this record, add this key to this index. These records are compact and close to the transaction’s intent. A single logical insert might stand for many physical changes: finding free space, moving records on a page, updating indexes, splitting a B-tree, or allocating new pages.

The difficulty is that logical undo and redo assume a consistent state. They assume the operation either happened, so it can be undone, or did not happen, so it can be redone. Crashes and failures often leave something messier. For example, an insert might crash after changing some pages but not others. Replaying “insert this row” is awkward when part of the insert may already have happened.

Physical logging describes the change in terms of stored data: this byte range, record, or page image changed from X to Y. Recovery does not need to understand the operation that caused the change. Undo restores the old value, and redo restores the new value. That makes physical records simple, reliable, and naturally idempotent. The downside is volume. A small logical operation can scatter changes across records, pages, indexes, and allocation structures, all of which may have to be logged.

Physiological logging sits between the two. The log record identifies the physical place being changed, usually a page, but describes the change logically within that page: insert a tuple into this slot, remove this key from this index page, update this record on page 42.

The write-ahead logging rule

Undo and redo only work if the log reaches disk at the right time. The ordering discipline that enforces this is called write-ahead logging, or WAL.

The core rule is:

  • For any change to a data page, the log record describing that change must be durable before the dirty data page is written to disk.

There is also a commit rule:

  • Before a transaction is reported as committed, the log records needed to redo that transaction must be durable.

The phrase “write-ahead” refers to this ordering. The log is written ahead of the data pages.

This rule is what lets the data files lag behind the logical state of the database. Dirty pages can be written early, before commit, because undo information exists. Dirty pages can also be written late, after commit, because redo information exists. The data files do not have to be perfectly up to date at every moment, as long as the log contains enough information to repair them after a crash.

Buffering the log

A naive implementation of the write-ahead log would write and fsync every log record as it’s generated. Sequential writes are cheap, but fsync isn’t: each one pays a fixed latency cost.

To solve this, databases stage log records in an in-memory log buffer. Backends append to the buffer as they do work; the buffer is flushed only when durability is required. The most common trigger is commit: a transaction must ensure the log has been flushed far enough to include its commit record before it can report success to the client. Between those forcing events, log records can accumulate and be written out in batches, either by the committing transaction itself or by a background writer draining the buffer.

Checkpoints

Logs can’t grow forever. Eventually, we need to discard old records so we don’t run out of disk space, and so crash recovery doesn’t have to replay history all the way back to the day the database was installed.

A checkpoint solves this. It’s a marker in the log that guarantees two things:

  • All log records written before the checkpoint belong to finished transactions.
  • All buffer-pool pages dirtied by those transactions have been flushed to the data files on disk.

Once a checkpoint completes, any log records older than the marker are no longer needed. The data is safely on disk, and there are no active transactions that might need undoing. The log can be safely truncated, and if a crash happens, recovery only needs to replay from the most recent checkpoint.

Putting it together

The recovery machinery follows from the buffer-pool policy.

A database wants steal because the buffer pool needs to evict dirty pages without waiting for transactions to finish. But steal means uncommitted changes can reach disk, so the database needs undo to preserve atomicity.

A database wants no-force because commit should not wait for every dirtied data page to be written back. But no-force means committed changes may be missing from the data files after a crash, so the database needs redo to preserve durability.

Write-ahead logging supplies the ordering discipline that makes both safe: the log reaches stable storage before the data pages that depend on it, and commit is acknowledged only after the committed transaction’s log records are durable.