Inside the Postgres write-ahead log

Postgres’s durability relies on a write-ahead log (WAL). Changes to a data page are logged and flushed to disk before the page reaches the data files. After a crash, the WAL is enough to replay the missing work and bring the database back to a consistent state.

📝 Why Postgres WAL does not need undo

The recovery log described in textbooks often carries both redo and undo logs. Postgres only needs redo in the WAL because MVCC handles rollback visibility. If a transaction aborts, its row versions can remain on disk; readers ignore them after checking transaction status. Crash recovery is therefore forward-only: replay WAL from the last checkpoint, then leave dead-row cleanup to VACUUM.

Where the WAL lives

The WAL lives in pg_wal under the data directory. It is stored as a sequence of fixed-size segment files, 16 MB each by default. Each segment has a 24-character hexadecimal name:

000000010000000000000023

The name splits into three 8-hex-digit parts: the timeline ID (00000001), the high 32 bits of the segment number (00000000), and the low 32 bits (00000023). A fresh cluster starts on timeline 1; the timeline advances on recovery or promotion.

The WAL is one continuous byte stream. Any position in that stream can be addressed by a single 64-bit offset, called a log sequence number, or LSN. LSNs are printed as two hexadecimal numbers separated by a slash: the upper and lower 32 bits.

SELECT pg_current_wal_lsn();
--  pg_current_wal_lsn
-- --------------------
--  0/4DB1EB70

An LSN can also be mapped back to the segment file that contains it:

SELECT pg_walfile_name('0/4DB1EB70');
--       pg_walfile_name
-- --------------------------
--  00000001000000000000004D

Older segments are typically recycled rather than deleted. Once their contents are no longer needed for recovery, Postgres renames the file so it can be reused as a future segment. This keeps file allocation off the commit hot path. What exactly counts as “no longer needed” is governed by checkpoints, which we’ll get to later.

WAL levels

wal_level controls how much information is written into the WAL. There are three values, each a superset of the previous:

minimal — just enough to recover from a crash. Certain bulk operations can skip WAL for the data they load, since the whole relation could be dropped on rollback anyway. This setting is incompatible with replication: max_wal_senders must be 0, or the server won’t start.
replica (default) — adds the records a streaming replica needs to apply WAL and an archive needs for point-in-time recovery.
logical — adds row-level change information for logical decoding, which is what powers logical replication.

Three LSNs: buffer, OS, disk

“The current end of the WAL” is not a single position. WAL records flow through a pipeline with three hand-offs: backend to buffer, buffer to OS, and OS to disk. Postgres exposes one LSN for each stage:

pg_current_wal_insert_lsn() — the insert position: the byte just past the last record appended to the WAL buffer. This is the logical end of the log.
pg_current_wal_lsn() — the write position: WAL that has been written from shared WAL buffers to WAL files, but not necessarily forced to durable storage.
pg_current_wal_flush_lsn() — the flush position: the last byte known to have reached durable storage.

These values satisfy flush ≤ write ≤ insert. On a quiet system they often sit at the same LSN; under load, gaps can open between them.

SELECT pg_current_wal_insert_lsn() AS insert_lsn,
       pg_current_wal_lsn()        AS write_lsn,
       pg_current_wal_flush_lsn()  AS flush_lsn;
--  insert_lsn | write_lsn  | flush_lsn
-- ------------+------------+------------
--  0/4DB1EB70 | 0/4DB1EB70 | 0/4DB1EB70

The write-ahead rule, in LSN terms: before a commit is reported to the client, the flush_lsn must be at or past that commit record’s insert position.

Checkpoints

The WAL lets Postgres recover by replaying changes after a crash, but that raises a question: replay from where? In principle, the system could start at the beginning of the WAL and apply every record ever written. In practice, that would make recovery slower and slower over the lifetime of the cluster, and it would require keeping old WAL forever.

A checkpoint is Postgres’s way of cutting that history. It creates a point in the WAL from which recovery can safely begin. At checkpoint time, dirty data pages are flushed to disk, and Postgres writes a checkpoint record to the WAL. If the server crashes later, recovery reads the latest checkpoint record, finds its redo pointer, and replays WAL from there.

That gives checkpoints two jobs: they bound crash recovery time, and they let Postgres recycle WAL segments that are no longer needed.

The background checkpointer runs periodically. Two settings control when it fires:

checkpoint_timeout (default 5 minutes) — maximum wall-clock time between checkpoints.
max_wal_size (default 1 GB) — a soft limit on WAL growth before an automatic checkpoint is triggered.

Whichever limit is reached first wins. You can also trigger a checkpoint manually:

CHECKPOINT;

A checkpoint does not block writers. It establishes a redo pointer, then spreads the required buffer flushing over a fraction of the checkpoint interval, controlled by checkpoint_completion_target (0.9 by default), so the disk is not saturated all at once. Once the checkpoint completes, Postgres writes a checkpoint record such as CHECKPOINT_ONLINE. That record points back to the checkpoint’s redo pointer: the place in WAL from which recovery must start. WAL before that redo point is no longer needed for local crash recovery, though replication slots, archive consumers, and wal_keep_size can hold older segments past that point.

There is one consequence of this boundary. Since recovery starts from the checkpoint’s redo pointer, WAL replay must be able to reconstruct any page modified after that point without depending on older WAL records. That is where full-page writes come in.

Full-page writes

WAL records usually describe incremental changes to a page. That works only if recovery starts from a valid copy of the page. Storage, however, typically does not guarantee atomic 8 KB page writes. If the system crashes while a page is being written, the on-disk copy can be torn: part old contents, part new contents. Replaying an incremental WAL record against that torn page could produce garbage.

To avoid that, Postgres logs the entire page the first time it is modified after a checkpoint. This is called a full-page image, or FPI. During recovery, the FPI restores the page to a known-good state, and later WAL records for that page can be replayed on top of it.

The cost is that WAL volume becomes uneven. The first write to a page after a checkpoint may carry a full-page image; subsequent writes to the same page in the same cycle usually do not. Postgres can omit unused space from the middle of the page image, so a sparsely populated page may add only a few hundred bytes, while a densely packed one can be close to 8 KB.

This is why checkpoint frequency affects WAL volume. Shorter checkpoint intervals create more “first writes after a checkpoint,” which means more FPIs for the same workload. Longer intervals reduce that FPI churn, but they also leave more WAL to replay after a crash.

Reading the WAL with `pg_waldump`

pg_waldump prints WAL records in human-readable form. Given a WAL directory and an LSN range, it walks the records in that range and shows the resource manager, record length, transaction ID, LSN, previous LSN, and record-specific details.

This makes it a useful way to see full-page writes directly. The following example creates a small table, forces a checkpoint, then updates two rows on the same heap page:

CREATE TABLE wal_demo (id int PRIMARY KEY, v text);
INSERT INTO wal_demo SELECT i, 'row-' || i FROM generate_series(1, 5) i;
CHECKPOINT;

SELECT pg_current_wal_lsn();  -- 0/4DB4A288
UPDATE wal_demo SET v = 'a' WHERE id = 1;
UPDATE wal_demo SET v = 'b' WHERE id = 2;
SELECT pg_current_wal_lsn();  -- 0/4DB4A4A0

Then dump that WAL range from the shell:

pg_waldump -p pg_wal -s 0/4DB4A288 -e 0/4DB4A4A0

rmgr: XLOG        len (rec/tot):     49/   293, tx:          0, lsn: 0/4DB4A288, prev 0/4DB4A210, desc: FPI_FOR_HINT , blkref #0: rel 1663/16385/36291 blk 0 FPW
rmgr: Heap        len (rec/tot):     72/    72, tx:      10026, lsn: 0/4DB4A3B0, prev 0/4DB4A288, desc: HOT_UPDATE old_xmax: 10026, old_off: 1, old_infobits: [], flags: 0x10, new_xmax: 0, new_off: 6, blkref #0: rel 1663/16385/36291 blk 0
rmgr: Transaction len (rec/tot):     46/    46, tx:      10026, lsn: 0/4DB4A3F8, prev 0/4DB4A3B0, desc: COMMIT 2026-05-06 11:54:50.703982 BST
rmgr: Heap        len (rec/tot):     72/    72, tx:      10027, lsn: 0/4DB4A428, prev 0/4DB4A3F8, desc: HOT_UPDATE old_xmax: 10027, old_off: 2, old_infobits: [], flags: 0x10, new_xmax: 0, new_off: 7, blkref #0: rel 1663/16385/36291 blk 0
rmgr: Transaction len (rec/tot):     46/    46, tx:      10027, lsn: 0/4DB4A470, prev 0/4DB4A428, desc: COMMIT 2026-05-06 11:54:50.704614 BST

The first record is the one to notice:

rmgr: XLOG ... desc: FPI_FOR_HINT ... FPW

FPW marks a full-page write for block 0 of the table. It appears before the first HOT_UPDATE because this is the first time that heap page has been touched since the checkpoint. The record is only 293 bytes here because the page is mostly empty and Postgres can omit the unused hole in the middle of the page image. On a densely packed page, the same kind of record can be close to 8 KB.

The next record is the actual heap update:

rmgr: Heap ... desc: HOT_UPDATE ...

It does not need to carry the whole page, because the full-page image is already earlier in the WAL stream. During recovery, Postgres can restore the page from the FPI first, then apply this incremental update on top.

The second update touches the same heap page:

rmgr: Heap ... desc: HOT_UPDATE ...

It is also 70 bytes. Since block 0 already has a full-page image in this checkpoint cycle, Postgres does not need to log the page again for this update. One FPI is enough to give recovery a known-good starting point for later changes to the same page.

A few fields are worth reading as you scan WAL records:

rmgr is the resource manager, the Postgres subsystem responsible for producing and replaying that record type. Heap covers tuple changes, Btree covers B-tree index changes, Transaction covers commits and aborts, and XLOG covers WAL-level bookkeeping such as checkpoints and full-page images.
len (rec/tot) shows the size of the record itself and the total size including block data such as full-page images.
tx is the transaction ID. The FPI_FOR_HINT record isn’t attached to any transaction, so its xid is 0.
lsn is where the record starts; prev points back at the previous record’s LSN. That back-pointer is how pg_waldump can walk forward from any starting point.
desc is the record-type-specific payload; blkref lists the data blocks the record touches.

The pattern is exactly what the checkpoint discussion predicts: after a checkpoint, the first change to a page pays the full-page-write cost; later changes to that same page can stay small and incremental.