How redo and undo keep databases atomic and durable
Atomicity and durability are two of the basic guarantees a database has to provide (see also ACID). Atomicity means a transaction is all-or-nothing: either all of its updates take effect, or none of them do. Durability means that once a transaction commits, its changes must survive a crash.
These requirements sound like they could be enforced by direct writes to disk: committed data should be on disk, and uncommitted data should not be. But databases do not write every change directly to disk as it happens. They route reads and writes through an in-memory cache of data pages called the buffer pool.
When a query needs a row, the database reads the containing page from disk into the buffer pool. When a transaction modifies that row, it changes the in-memory page and marks it dirty. At that point, the page in memory is newer than the page on disk. The buffer pool manager then has to decide when that dirty page should be written back.
The safe (and slow) choice
The simplest way to preserve atomicity and durability is to make the buffer pool follow two strict rules: no-steal and force.
No-steal means the buffer pool is not allowed to write pages dirtied by uncommitted transactions to disk. As long as a transaction is still in progress, its dirty pages must stay in memory. This makes atomicity easy. If the transaction aborts, the database can discard the in-memory changes, knowing that none of them reached the data files.
Force means the opposite kind of strictness at commit time. Before a transaction is allowed to commit, every page it dirtied must be written to disk. This makes durability easy. Once the client is told the transaction committed, the data files already contain the transaction’s changes.
Together, no-steal and force give a clean mental model: uncommitted changes never reach disk, and committed changes are always on disk before commit returns. In that world, crash recovery has little to do.
The problem is that this policy performs badly. No-steal ties the buffer pool’s memory usage to the size and duration of active transactions. A transaction that dirties many pages must keep all of those pages resident until it finishes. If a long-running transaction modifies more pages than the buffer pool can hold, the database cannot safely evict them, because writing them out would put uncommitted data on disk. Force creates a different bottleneck. A transaction’s updates may be spread across many unrelated data pages. Requiring all of those pages to be written at commit time turns commit into a collection of scattered data-file writes. Commit latency becomes tied to the number and physical location of the pages the transaction touched.
So the strict policy is easy to reason about, but too limiting for a practical system.
The practical choice: steal and no-force
To get better throughput, database systems usually want the opposite freedoms: they want to be able to evict dirty pages before the transaction that dirtied them commits, and they want to commit transactions without forcing all of their dirty data pages to disk immediately. Those two choices are called steal and no-force.
With steal, the buffer pool may write an uncommitted dirty page to disk when it needs the memory. This prevents active transactions from pinning too much of the buffer pool, but it creates an atomicity problem. If the transaction later aborts, the data files may already contain changes that should not persist.
With no-force, commit does not wait for the transaction’s dirty data pages to be written to the data files. Those pages can remain in memory and be flushed later. This avoids making commit depend on scattered data-page writes, but it creates a durability problem. If the server crashes after commit but before those pages are flushed, the data files may be missing changes from a transaction that already committed.
Restoring atomicity: undo logging
Steal creates the need for undo: if an uncommitted change reaches disk, the database must be able to remove it. For every change a transaction makes to a page, the system writes a log record describing exactly what was on the page before the modification. This is enough information to reverse the change later.
If the transaction aborts, the database walks its undo records backward and applies them in reverse order. Each undo step restores part of the database to the state it had before the transaction made its change. The same idea applies after a crash. Any transaction that was still in progress at the time of the crash is treated as aborted. Some of its changes may already have reached the data files because the system uses steal. Recovery uses the undo log to remove those changes.
Restoring durability: redo logging
No-force creates the need for redo: if a committed change has not reached disk, the database must be able to repeat it. For every modification a transaction makes, the system writes a log record describing the new change. On a clean shutdown, these records are ignored because the buffer pool eventually writes all dirty pages to disk anyway.
But if the database crashes, the redo log is what catches the data files up. On restart, recovery walks the redo log forward and reapplies every record from a committed transaction whose changes hadn’t yet reached the data files. The data file catches up to exactly where it was in memory right before the crash.
This is the key performance tradeoff. Rather than forcing many scattered data pages to disk at commit time, the database writes sequentially to the log. A transaction may touch pages spread across the data file, but its log records are appended to one place. Commit can therefore be reduced to flushing the log far enough to make the transaction durable.
The write-ahead logging rule
Undo and redo only work if the log reaches disk at the right time. The ordering discipline that enforces this is called write-ahead logging, or WAL.
The core rule is:
- For any change to a data page, the log record describing that change must be durable before the dirty data page is written to disk.
There is also a commit rule:
- Before a transaction is reported as committed, the log records needed to redo that transaction must be durable.
The phrase “write-ahead” refers to this ordering. The log is written ahead of the data pages.
This rule is what lets the data files lag behind the logical state of the database. Dirty pages can be written early, before commit, because undo information exists. Dirty pages can also be written late, after commit, because redo information exists. The data files do not have to be perfectly up to date at every moment, as long as the log contains enough information to repair them after a crash.
Buffering the log
If every log record were written and flushed to disk individually, the log would become a bottleneck. Even though log writes are sequential, forcing storage to sync for every small record would limit throughput.
So databases use a log buffer: an in-memory area where log records are appended as transactions run. The log buffer is flushed to the log file in batches. At commit, a transaction only needs to ensure that the log has been flushed far enough to include its commit record and the redo records it depends on. It does not need to flush the entire buffer pool.
Checkpoints
Logs can’t grow forever. Eventually, we need to discard old records so we don’t run out of disk space, and so crash recovery doesn’t have to replay history all the way back to the day the database was installed.
A checkpoint solves this. It’s a marker in the log that guarantees two things:
- All log records written before the checkpoint belong to finished transactions.
- All buffer-pool pages dirtied by those transactions have been flushed to the data files on disk.
Once a checkpoint completes, any log records older than the marker are no longer needed. The data is safely on disk, and there are no active transactions that might need undoing. The log can be safely truncated, and if a crash happens, recovery only needs to replay from the most recent checkpoint.
Taking a checkpoint forces the database to do the random I/O it was trying to avoid, writing every dirty page back to disk. To keep that from freezing the system, modern databases use fuzzy checkpoints, trickling dirty pages out in the background so the cost is spread across the checkpoint interval rather than landing all at once.
Putting it together
The recovery machinery follows from the buffer-pool policy.
A database wants steal because the buffer pool needs to evict dirty pages without waiting for transactions to finish. But steal means uncommitted changes can reach disk, so the database needs undo to preserve atomicity.
A database wants no-force because commit should not wait for every dirtied data page to be written back. But no-force means committed changes may be missing from the data files after a crash, so the database needs redo to preserve durability.
Write-ahead logging supplies the ordering discipline that makes both safe: the log reaches stable storage before the data pages that depend on it, and commit is acknowledged only after the committed transaction’s log records are durable.