Concurrency control in Delta Lake

A Delta table stores rows in Parquet files and tracks table versions in a transaction log under _delta_log/. That split is what makes concurrent writes work. A writer reads a snapshot, stages new data files, validates against any commits that appeared meanwhile, and then publishes the next numbered JSON file under _delta_log/. Until that JSON file exists, the staged Parquet files are just files in the directory. Other readers and writers ignore them.

The protocol in brief

On disk, the table looks like this:

table_root/
  part-00000-...-c000.snappy.parquet
  part-00001-...-c000.snappy.parquet
  ...
  _delta_log/
    00000000000000000000.json
    00000000000000000001.json
    ...
    00000000000000000010.checkpoint.parquet
    00000000000000000010.json
    00000000000000000011.json
    _last_checkpoint

Each commit creates one new JSON file in _delta_log/. These files are numbered sequentially, one per table version.

A commit file contains a list of actions: small JSON records describing what changed in that version.

add — a new Parquet file becomes part of the table at this version, along with its size, partition values, and column statistics.
remove — a Parquet file is logically deleted from the table as of this version. The file may still exist on disk for time travel until VACUUM removes it.
commitInfo — operational metadata for the commit: who ran it, what operation, timestamp.
metaData — schema and table-level configuration. Included whenever it changes.
protocol — the minimum reader and writer versions required for this table.

Replaying every commit from zero to find the current state would scale linearly with table age. To keep that bounded, Delta periodically writes a checkpoint: a Parquet file under _delta_log/ that captures the cumulative state up to some version K as a single set of add, remove, metaData, and protocol records. The _last_checkpoint file at the root of _delta_log/ points at the most recent checkpoint version. Checkpoints land every 10 commits by default.

A reader’s view of the table is just the most recent checkpoint plus the JSON commits after it.

Inspecting the log

Running Spark SQL locally

Install PySpark:

uv pip install pyspark==4.1.1

Start a Spark SQL shell with:

spark-sql --packages io.delta:delta-spark_2.13:4.1.0 \
  --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
  --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"

The session will write to a local directory and create the same _delta_log/ structure any production cluster would.

Create a small table and run a few inserts:

CREATE TABLE events (id BIGINT, payload STRING)
USING DELTA;

INSERT INTO events VALUES (1, 'hello');
INSERT INTO events VALUES (2, 'world');

The table directory now has the Parquet data files plus a _delta_log/:

$ ls spark-warehouse/events/
_delta_log
part-00000-...-c000.snappy.parquet
part-00001-...-c000.snappy.parquet

$ ls spark-warehouse/events/_delta_log/
00000000000000000000.json
00000000000000000001.json
00000000000000000002.json

Three commits: the CREATE TABLE, then the two INSERTs. Each commit JSON is a list of newline-delimited actions. The most recent one looks like this:

{"commitInfo":{
  "timestamp":1714477200000,
  "operation":"WRITE",
  "operationParameters":{"mode":"Append"},
  "readVersion":1
}}
{"add":{
  "path":"part-00001-...-c000.snappy.parquet",
  "size":812,
  "partitionValues":{},
  "modificationTime":1714477200000,
  "dataChange":true,
  "stats":"{\"numRecords\":1,\"minValues\":{\"id\":2},\"maxValues\":{\"id\":2}}"
}}

This commit has two actions: a commitInfo describing the operation and an add for the new Parquet file. The stats field is the per-file column statistics that readers use for data skipping. The first commit (the CREATE TABLE) would also carry metaData and protocol actions setting the schema and the required reader/writer versions.

Let the table accumulate more commits and a checkpoint eventually shows up:

$ ls spark-warehouse/events/_delta_log/
00000000000000000000.json
...
00000000000000000010.checkpoint.parquet
00000000000000000010.json
00000000000000000011.json
_last_checkpoint

Snapshot reads

To resolve the current state of the table, a reader reads _last_checkpoint, loads that checkpoint Parquet file, and replays the add and remove actions from every JSON commit after it. The resulting snapshot is a list of Parquet files plus the schema and metadata in force at that version. The reader uses the per-file statistics in that snapshot to skip files that cannot match the query, then reads the remaining Parquet files.

What matters for concurrency is that the snapshot is immutable for the duration of the query. Once the reader has resolved version 11, any subsequent commits are invisible to it. A writer can append versions 12, 13, 14 mid-scan and the running query doesn’t notice. It’s reading from the file list it already pinned, not from the live log.

The same mechanism also gives Delta time travel. A query like VERSION AS OF 7 resolves the log at an earlier version and scans the files that were live then.

The write protocol

Delta uses optimistic concurrency control: a writer does its work against a fixed snapshot, then checks at commit time whether anything relevant changed while it was running.

There are three phases:

Read. The writer lists _delta_log/ and pins a starting version N. Everything that follows is computed against that snapshot. Read predicates and the set of files consulted are recorded; the validation phase will need them.

Write. The writer does its real work outside the log: new Parquet files for inserts, rewritten files for updates, deletion vectors and change-data files if those are enabled. All of it lands in the table directory. Nothing in the log references any of it yet, so no other writer can see it.

Validate and commit. The writer assembles a list of actions describing what it just did and tries to publish them as _delta_log/<N+1>.json. If that version already exists, another writer committed first. The writer then reads the commits that appeared after version N and compares them with its read set and planned changes. If there is no conflict, it rebases onto the new head and tries again at the next version. If there is a conflict, it aborts with a concurrency exception.

What counts as a conflict

The validation phase compares the commits in the range (N, head] against this writer’s read predicates and read-files set. Whether something counts as a conflict depends on what the two operations touched.

Two appenders never conflict with each other. Each added new files; neither touched the other’s files.

An UPDATE and an appender don’t conflict either. The update removed and rewrote some files; the appender added new ones in disjoint locations.

Two UPDATEs on the same file do conflict. The first commit removes file A and adds file A'. The second writer also planned to remove A and add its own version. By the time the second writer validates, A is no longer in the snapshot it read. The file was removed under it. That’s a ConcurrentDeleteReadException.

Other cases get their own named exceptions:

ConcurrentDeleteDeleteException — both commits removed the same file. Two DELETEs racing on the same data.
ConcurrentAppendException — a write that registered a partition predicate (e.g. a MERGE INTO ... WHERE date = '2026-04-29') saw new files appear in that partition between read and commit. The new files might match its predicate, so the safe thing is to reject.
MetadataChangedException — a concurrent commit changed the schema or table properties. Any in-flight writer is reading against the wrong metadata and has to start over.
ProtocolChangedException — a concurrent commit bumped minReaderVersion or minWriterVersion. The current writer might not implement the new protocol; abort.

Isolation levels

The validation rules above implement WriteSerializable isolation, which is the default. Delta supports three isolation levels:

Serializable — the strongest. The committed schedule of reads and writes is equivalent to some serial order, and that order matches the order of versions in the log.
WriteSerializable — the default. Writes are serialisable, but the serial order doesn’t have to match the log’s version order. A commit is allowed if the result could have come from running the operations in some serial order, even one that disagrees with the order they committed in.
SnapshotIsolation — used internally for commits that don’t change data, like no-op metadata operations. The validation phase doesn’t add files to the conflict check.

The gap between the first two is subtle and easiest to see by example. A long-running DELETE FROM t WHERE category = 'archive' and a quick INSERT INTO t VALUES (...) start at the same time. The insert commits first as version 51. The delete commits a moment later as version 52.

Under Serializable, the delete has to be consistent with the world after the insert. If the inserted row matched category = 'archive', the delete should have removed it. It didn’t, because the delete pinned its snapshot before the insert committed. Serializable rejects this and the delete has to retry.

Under WriteSerializable, the engine asks whether any serial order matches the final result. The order “delete first, then insert” can: the delete removes the rows that matched in its snapshot, and then the insert appends a new row. The log says the insert committed before the delete, but the final state is still consistent with a valid serial order, so the delete can commit.

WriteSerializable is the default because the cases where the distinction matters are narrow and the cost of rejecting them is high. Set the table property delta.isolationLevel = 'Serializable' if you need the stronger guarantee. Reads run at snapshot isolation regardless of the writer’s level.

Mutual exclusion at commit time

The whole optimistic protocol depends on a single primitive: exactly one writer can successfully create _delta_log/<N+1>.json. If two writers race and both succeed, the log forks and the table is corrupted.

On stores with atomic create-if-absent semantics — HDFS and Azure Data Lake Storage via rename, GCS and Azure Blob via put-if-absent — this is free. The filesystem rejects the loser’s create call and the optimistic protocol’s retry loop kicks in.

S3 was the awkward one. Until recently, PutObject had no conditional-create primitive: two clients writing the same key would each succeed and the last write would silently win. Delta worked around this with an external commit coordinator — a DynamoDB table — that brokered the race. Each writer claimed the next version in DynamoDB first; only the winner was allowed to write the JSON file. The filesystem write still happened, but the mutual exclusion came from the database, not from S3.

S3 has since added the If-None-Match precondition on PutObject, which gives Delta the same primitive it gets on other stores. The DynamoDB log store is no longer required for new S3 writers.