Inside Postgres query execution: the Volcano model
Imagine a production line that only runs on demand. The station at the end asks for the next item, that request gets passed backward through the line, and each station pulls from the one before it. Nothing gets processed until a later station asks for it.
Query execution in Postgres works in much the same way. The executor runs a tree of plan nodes: operations like scans, filters, joins, and sorts. You can think of each node as a station in that line.
The Volcano model
The executor runs this tree using the Volcano processing model, a demand-driven execution model used by many databases. Execution starts at the root of the plan tree, which requests a tuple from its child. This request cascades down to a leaf node, typically a table or index scan, where data is fetched. The tuple then flows back up the tree, each node processing it along the way, until it reaches the root and gets returned to the client. Because tuples are pulled through the plan one at a time, this is often described as tuple-at-a-time execution.
This works because every node follows the same protocol: open() to initialize state, next() to produce the next tuple, and close() to clean up. A node can call next() on its children whenever it needs more input.
This interface makes nodes composable. A complex plan is a tree of these nodes wired together, each consuming tuples from its children and passing results upward.
Pipeline breakers
Most of the time, tuples flow smoothly through the tree. A filter node checks a row and passes it along; a nested loop join can emit a joined row as soon as it finds a match. This is called pipelining.
But some operations can’t produce any output until they’ve consumed all of their input. These are called pipeline breakers. Common examples include sorting, which must see all rows before it can determine their order, and the build phase of a hash join, which must construct the entire hash table from one input before probing it with the other. We’ll see the impact of these pipeline breakers later.
This video, part of a lecture series by Universität Tübingen, is a good introduction to the Volcano model and nicely illustrates the impact of pipeline breakers.
In short, you can see the impact when fetching from a cursor. If the query includes a pipeline breaker such as a sort node, fetching the first row may take noticeably longer because the executor must first consume all of the input before it can produce any output. For example:
BEGIN;
DECLARE demo CURSOR FOR -- These two rows are available immediately SELECT i FROM generate_series(1, 2) AS i UNION ALL -- This subquery needs sorting, acting as a pipeline breaker ( SELECT i FROM generate_series(5000000, 3, -1) AS i ORDER BY i LIMIT 2 );
-- Fetching the first two tuples (from the first part of UNION ALL) is fastFETCH NEXT demo; -- 4.558 msFETCH NEXT demo; -- 7.095 ms
-- Fetching the third tuple requires the sort to complete processing-- before emitting its smallest row. This will take significantly longer.FETCH NEXT demo; -- 697.278 ms
-- Fetching the fourth tuple (the second from the sorted subquery) is fast again,-- as the sort has already completed.FETCH NEXT demo; -- 8.205 ms
ROLLBACK;Enable timing in psql with \timing to see the execution durations.
How Postgres implements it
Query execution follows the same pattern:
-
ExecutorStartinitializes execution. Starting at the root of the plan tree, it callsExecInitNode(roughly analogous toopen()) recursively to set up each node. -
ExecutorRundrives execution. It repeatedly callsExecProcNodeon the root node to fetch result tuples. It continues until the root node signals that no more tuples are available. -
ExecutorEndcleans up once execution is finished. It callsExecEndNoderecursively to release resources, corresponding toclose().
For a more comprehensive overview, see the executor README.
The function that ties this all together is ExecProcNode. It’s how any node asks its child for the next tuple, the equivalent of calling next():
static inline TupleTableSlot *ExecProcNode(PlanState *node){ if (node->chgParam != NULL) ExecReScan(node);
// Call the node-specific processing function via function pointer return node->ExecProcNode(node);}Each node type assigns its own function pointer to the ExecProcNode field. For example, calling it on a sequential scan node it invokes ExecSeqScan; on a sort node, ExecSort; and so on. The functions handling these node executions live in the src/backend/executor directory.
Tracing execution with GDB
Installing Postgres from source
To trace execution, it’s best to compile Postgres from source with debug symbols enabled and optimizations disabled. The official documentation provides detailed instructions. Here, we’ll provide a brief outline of the steps.
Steps
- Update package lists and install necessary build dependencies:
sudo apt-get update && sudo apt-get -y install \ build-essential \ libreadline-dev \ zlib1g-dev \ flex \ bison \ libxml2-dev \ libxslt-dev \ libssl-dev \ wget \ gdb- Download and extract the Postgres source code (replace version number if needed):
wget https://ftp.postgresql.org/pub/source/v18.0/postgresql-18.0.tar.gztar -xf postgresql-18.0.tar.gzcd postgresql-18.0- Configure the build, compile, and install. The
--enable-debugoption adds debug symbols,CFLAGS='-O0'disables compiler optimizations. For more details on the configure step: https://www.postgresql.org/docs/current/install-make.html#CONFIGURE-OPTIONS
./configure --enable-debug --without-icu CFLAGS='-O0'make -j$(nproc)sudo make install- Set up the data directory:
sudo mkdir -p /usr/local/pgsql/data- Create a new user, initialize the data directory and start the Postgres cluster.
sudo useradd --system --create-home postgressudo chown postgres:postgres /usr/local/pgsql/datasudo -iu postgres
# now as postgres:/usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data/usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data -l logfile startTest the installation by connecting with psql:
/usr/local/pgsql/bin/psql -U postgres# You should get the psql prompt: postgres=## Type \q to exitWe can use GDB to trace function calls during query execution and see the control flow. The following script sets breakpoints on selected executor functions and logs each call with indentation to show the call stack depth:
python
import gdb
class LogFunctionBreakPoint(gdb.Breakpoint): def __init__(self, func_name): super().__init__(func_name) self.func_name = func_name self.silent = True
def stop(self): frame = gdb.selected_frame() depth = 0 while frame: frame = frame.older() if frame: depth += 1
indent = ' ' * depth print(f"{indent} -> {self.func_name}")
return False
functions_to_trace = ['ExecutorRun', 'ExecProcNode', 'MultiExecProcNode', 'ExecSort', 'ExecHashJoin', 'ExecSeqScan', 'MultiExecHash', 'ExecLimit']
for fn in functions_to_trace: LogFunctionBreakPoint(fn)Now, let’s connect to Postgres via psql and create some demo data:
CREATE TABLE demo (a int);
INSERT INTO demoSELECT s FROM generate_series(1, 3) s;Find the backend PID and attach GDB:
SELECT pg_backend_pid();sudo gdb -p 65040 -x script.gdbTracing a simple query
Let’s execute a simple LIMIT query:
SELECT *FROM demoLIMIT 2;The trace looks like this:
-> ExecutorRun -> ExecProcNode -> ExecLimit -> ExecProcNode -> ExecSeqScan -> ExecProcNode -> ExecLimit -> ExecProcNode -> ExecSeqScan -> ExecProcNode -> ExecLimitYou can see the pull-based flow in action. ExecutorRun asks the root (Limit) for a tuple. Limit asks its child (SeqScan). SeqScan returns a row. That’s one cycle.
The pattern repeats for the second tuple. On the third call, ExecLimit knows it’s already returned 2 rows, so it signals completion without asking SeqScan for anything.
Tracing a more complex query
Now let’s try a join with sorting:
-- Force a hash joinSET enable_nestloop = off;SET enable_mergejoin = off;
SELECT d1.aFROM demo d1JOIN demo d2 ON d1.a = d2.aORDER BY d1.a DESC;The trace is more interesting:
-> ExecutorRun -> ExecProcNode -> ExecSort -> ExecProcNode -> ExecHashJoin -> ExecProcNode -> ExecSeqScan -> MultiExecProcNode -> MultiExecHash -> ExecProcNode -> ExecSeqScan -> ExecProcNode -> ExecSeqScan -> ExecProcNode -> ExecSeqScan -> ExecProcNode -> ExecSeqScan -> ExecProcNode -> ExecHashJoin -> ExecProcNode -> ExecSeqScan -> ExecProcNode -> ExecHashJoin -> ExecProcNode -> ExecSeqScan -> ExecProcNode -> ExecHashJoin -> ExecProcNode -> ExecSeqScan -> ExecProcNode -> ExecSort -> ExecProcNode -> ExecSort -> ExecProcNode -> ExecSortThis is where the pipeline breaker shows up clearly. When ExecutorRun first asks Sort for a tuple, Sort can’t return one yet because it needs all of its input before it can determine the final ordering. So, it initiates a cascade of pulls from the HashJoin tree below it.
The HashJoin node pulls from its child sequential scans, builds its hash table, and finds matches. The Sort node must pull from this join until all matched rows have been consumed. Only once the HashJoin is exhausted can Sort begin returning results to the top level. From that point on, each call to ExecSort returns the next row from the already-sorted set.
The trade-off
The cost of this design is per-tuple overhead. Look at the trace for the join query: three rows in each table, and the executor made dozens of function calls to produce the result. Each tuple passes through ExecProcNode, into the node-specific function, and back.
The upside is composability. Every node follows the same pull-based protocol, so adding a new node type is a matter of implementing open, next, and close and letting it slot into any plan tree. If you’re curious how the per-tuple overhead can be mitigated, take a look at vectorized execution.