Inside PostgreSQL Query Execution: The Volcano Model
Imagine a production line where nothing moves until the end of the line asks for it. The final station says “I need the next item,” and that request ripples backward. Each station asks the one before it, all the way back to the raw materials. Parts only get pulled through the line when someone downstream actually needs them.
That’s more or less how PostgreSQL executes your queries. By the time a query reaches the executor, it’s already been through several stages: the SQL text has been parsed into a tree, views have been expanded, and the planner has figured out the most efficient way to actually get the data. The planner outputs a tree of plan nodes, each representing a specific operation like scanning, filtering, sorting, or joining. Other database systems sometimes call these operators. Think of each node as a station on our production line.
The Volcano model
The executor runs this tree using the Volcano processing model, a demand-driven execution model used by PostgreSQL (and many other databases). Execution starts at the root of the plan tree (the final station), which requests a tuple from its child. This request cascades down to a leaf node (typically a table or index scan—which fetches the actual data). The tuple then flows back up the tree, each node processing it along the way, until it reaches the root and gets returned to the client. Because this happens one tuple at a time, it’s known as tuple-at-a-time processing
This works because every node follows a standard three-method protocol: open() to set up the node (allocate memory, initialize state, and recursively open any child nodes it depends on), next() to produce and return the next tuple when asked (calling next() on children if it needs input), and close() to clean up resources when processing is complete.
This interface makes node composable. Any node can consume tuples from any other node as long as they adhere to the open/next/close interface. This allows complex plans to be built from simple, reusable blocks.
When the line stalls: pipeline breakers
Most of the time, tuples flow smoothly through the tree. A filter node checks a row and passes it along; a nested loop join emits a combined row the exact moment it finds a match. This is called pipelining.
However, pipeline breakers cannot produce output until they consume their entire input. They act like a station that needs every item before it can begin sorting or assembling.
But some operations can’t produce any output until they’ve consumed all of their input. These are pipeline breakers, and they’re the equivalent of a station that needs to receive every item before it can start sorting or assembling them.
Common examples include sorting (you must see all rows to know which comes first) and the build phase of a hash join (the complete hash table must be built from one input before probing it with the other).
This video, part of a lecture series by Universität Tübingen, is a good resource on the Volcano model and and demonstrates the impact of pipeline breakers.
In short, if you create a cursor for a query that includes a pipeline breaker (like a sort node), you will notice a delay when fetching the first records as the operation first needs to consume all the input data. For example:
begin;
declare demo cursor for -- These two rows are available immediately select i from generate_series(1, 2) as i union all -- This subquery needs sorting, acting as a pipeline breaker (select i from generate_series(5000000, 3, -1) as i order by i limit 2);
-- Fetching the first two tuples (from the first part of UNION ALL) is fastfetch next demo; -- 4.558 msfetch next demo; -- 7.095 ms
-- Fetching the third tuple requires the sort to complete processing-- its entire input (5 million rows) before emitting its smallest row.-- This will take significantly longer.fetch next demo; -- 697.278 ms
-- Fetching the fourth tuple (the second from the sorted subquery) is fast again,-- as the sort has already completed.fetch next demo; -- 8.205 ms
rollback;Enable timing in psql with \timing to see the execution durations.
Volcano execution model in PostgreSQL
Now, let’s take a look at how the Volcano model is implemented in Postgres. The execution of a query in Postgres follows this pattern:
-
ExecutorStart: This function initializes the query execution. It callsExecInitNoderecursively on the plan tree, starting from the root.ExecInitNodeperforms the setup for each node (analogous toopen()), -
ExecutorRun: This function drives the execution. It enters a loop that repeatedly callsExecProcNodeon the top-level plan node to fetch result tuples (analogous to callingnext()on the root). It continues until the top node signals that no more tuples are available. -
ExecutorEnd: This function cleans up after execution is complete. It callsExecEndNoderecursively on the tree to release resources, corresponding to theclose()operation.
For a more comprehensive overview, see the executor README.
The key function that ties it all together is ExecProcNode. It’s the mechanism by which any node asks its child for the next tuple — the practical equivalent of calling next():
static inline TupleTableSlot *ExecProcNode(PlanState *node){ if (node->chgParam != NULL) ExecReScan(node);
// Call the node-specific processing function via function pointer return node->ExecProcNode(node);}Each node type assigns its own function pointer to that ExecProcNode field, so calling it on a SeqScan node invokes ExecSeqScan, calling it on a Sort node invokes ExecSort, and so on. The functions handling these node executions reside in the src/backend/executor directory.
Let’s make this concrete. Given:
select * from demo limit 2;ExecutorRun calls ExecProcNode on the root node (Limit). That invokes ExecLimit, which calls ExecProcNode on its child (SeqScan), which invokes ExecSeqScan to fetch a tuple. The tuple flows back up. Repeat until the limit is satisfied.
Tracing execution with GDB
Installing Postgres from source
To trace execution, it’s best to compile PostgreSQL from source with debug symbols enabled and optimizations disabled. The official documentation provides detailed instructions. Here, we’ll provide a brief outline of the steps. For an automated approach using Terraform to deploy and install Postgres on a VM in Google Cloud, you can refer to this Gist.
Steps
- Update package lists and install necessary build dependencies (run as root or use sudo):
apt-get update && apt-get -y install \ build-essential \ libreadline-dev \ zlib1g-dev \ flex \ bison \ libxml2-dev \ libxslt-dev \ libssl-dev \ wget \ gdb- Download and extract the PostgreSQL source code (replace version number if needed):
wget https://ftp.postgresql.org/pub/source/v16.3/postgresql-16.3.tar.gztar -xf postgresql-16.3.tar.gzcd postgresql-16.3- Configure the build, compile, and install. The
--enable-debugoptions adds debug symbols,CFLAGS='-O0'disables compiler optimizations. For more details on the configure step: https://www.postgresql.org/docs/current/install-make.html#CONFIGURE-OPTIONS
./configure --enable-debug --without-icu CFLAGS='-O0'make -j$(nproc)make install- Set up the data directory:
mkdir -p /usr/local/pgsql/data- Create a new user, initialize the data directory and start the PostgreSQL cluster.
useradd --system postgreschown -R postgres:postgres /usr/local/pgsql/datasu - postgres
sudo -u postgres /usr/local/pgsql/bin/initdb -D /usr/local/pgsql/datasudo -u postgres /usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data -l logfile startTest the installation by connecting with psql:
/usr/local/pgsql/bin/psql -U postgres# You should get the psql prompt: postgres=## Type \q to exitTo better understand how the Volcano model works in practice, we can use GNU Debugger (GDB) to trace function calls during query execution. This approach allows us to see the actual control flow as a query is processed.
The following script sets breakpoints on key executor functions and logs each call with indentation to show the call stack depth:
python
import gdb
class LogFunctionBreakPoint(gdb.Breakpoint): def __init__(self, func_name): super().__init__(func_name) self.func_name = func_name self.silent = True
def stop(self): frame = gdb.selected_frame() depth = 0 while frame: frame = frame.older() if frame: depth += 1
indent = ' ' * depth print(f"{indent} -> {self.func_name}")
return False
functions_to_trace = ['ExecutorRun', 'ExecProcNode', 'MultiExecProcNode', 'ExecSort', 'ExecHashJoin', 'ExecSeqScan', 'MultiExecHash', 'ExecLimit']
for fn in functions_to_trace: LogFunctionBreakPoint(fn)Now, let’s connect to Postgres via psql and create some demo data:
create table demo (a int);insert into demo select s from generate_series(1, 3) s;Find the backend PID and attach GDB:
select pg_backend_pid();sudo gdb -p 65040 -x script.gdbTracing a simple query
Let’s execute our simple LIMIT query again:
select * from demo limit 2;Here’s what the trace looks like:
-> ExecutorRun -> ExecProcNode -> ExecLimit -> ExecProcNode -> ExecSeqScan -> ExecProcNode -> ExecLimit -> ExecProcNode -> ExecSeqScan -> ExecProcNode -> ExecLimitYou can see the pull-based flow in action. ExecutorRun asks the root (Limit) for a tuple. Limit asks its child (SeqScan). SeqScan delivers a row. That’s one cycle.
The pattern repeats for the second tuple. On the third call, ExecLimit knows it’s already returned 2 rows, so it signals completion without asking SeqScan for anything.
Tracing a more complex query
Now let’s try a join with sorting:
-- Force specific join types for demonstrationSET enable_nestloop = off;SET enable_mergejoin = off;SET enable_bitmapscan = off;
select d1.a from demo d1 join demo d2 on d1.a = d2.a order by d1.a desc;The trace gets more interesting:
-> ExecutorRun -> ExecProcNode -> ExecSort -> ExecProcNode -> ExecHashJoin -> ExecProcNode -> ExecSeqScan -> MultiExecProcNode -> MultiExecHash -> ExecProcNode -> ExecSeqScan -> ExecProcNode -> ExecSeqScan -> ExecProcNode -> ExecSeqScan -> ExecProcNode -> ExecSeqScan -> ExecProcNode -> ExecHashJoin -> ExecProcNode -> ExecSeqScan -> ExecProcNode -> ExecHashJoin -> ExecProcNode -> ExecSeqScan -> ExecProcNode -> ExecHashJoin -> ExecProcNode -> ExecSeqScan -> ExecProcNode -> ExecSort -> ExecProcNode -> ExecSort -> ExecProcNode -> ExecSortThis is where the pipeline breaker shows up clearly. When ExecutorRun first asks Sort for a tuple, Sort can’t answer yet — it needs all the data first. So it pulls repeatedly from HashJoin, which in turn pulls from its children (building the hash table from one SeqScan, then probing it with rows from the other). Only after HashJoin is exhausted does Sort have everything it needs. From that point on, the subsequent calls to ExecSort return immediately — the sorted results are already in memory, ready to be handed out one at a time.
Wrapping up
The Volcano model gives PostgreSQL a clean, composable way to execute queries: every node follows the same pull-based protocol, and complex plans are just trees of these interchangeable building blocks. That said, the tuple-at-a-time approach does come with overhead — all those function calls add up. This has led analytical databases to explore alternatives like vectorized execution, which processes data in batches rather than individual tuples.