Inside Postgres query execution: the Volcano model

Imagine a production line that only runs on demand. The station at the end asks for the next item, that request gets passed backward through the line, and each station pulls from the one before it. Nothing gets processed until a later station asks for it.

Query execution in Postgres works in much the same way. The executor runs a tree of plan nodes: operations like scans, filters, joins, and sorts. You can think of each node as a station in that line.

The Volcano model

The executor runs this tree using the Volcano processing model, a demand-driven execution model used by many databases. Execution starts at the root of the plan tree, which requests a tuple from its child. This request cascades down to a leaf node, typically a table or index scan, where data is fetched. The tuple then flows back up the tree, each node processing it along the way, until it reaches the root and gets returned to the client. Because tuples are pulled through the plan one at a time, this is often described as tuple-at-a-time execution.

This works because every node follows the same protocol: open() to initialize state, next() to produce the next tuple, and close() to clean up. A node can call next() on its children whenever it needs more input.

This interface makes nodes composable. A complex plan is a tree of these nodes wired together, each consuming tuples from its children and passing results upward.

Pipeline breakers

Most of the time, tuples flow smoothly through the tree. A filter node checks a row and passes it along; a nested loop join can emit a joined row as soon as it finds a match. This is called pipelining.

But some operations can’t produce any output until they’ve consumed all of their input. These are called pipeline breakers. Common examples include sorting, which must see all rows before it can determine their order, and the build phase of a hash join, which must construct the entire hash table from one input before probing it with the other. We’ll see the impact of these pipeline breakers later.

📖 More on pipeline breakers ...

This video, part of a lecture series by Universität Tübingen, is a good introduction to the Volcano model and nicely illustrates the impact of pipeline breakers.

In short, you can see the impact when fetching from a cursor. If the query includes a pipeline breaker such as a sort node, fetching the first row may take noticeably longer because the executor must first consume all of the input before it can produce any output. For example:

BEGIN;

DECLARE demo CURSOR FOR
    -- These two rows are available immediately
    SELECT i FROM generate_series(1, 2) AS i
    UNION ALL
    -- This subquery needs sorting, acting as a pipeline breaker
    (
        SELECT i
        FROM generate_series(5000000, 3, -1) AS i
        ORDER BY i
        LIMIT 2
    );

-- Fetching the first two tuples (from the first part of UNION ALL) is fast
FETCH NEXT demo; --   4.558 ms
FETCH NEXT demo; --   7.095 ms

-- Fetching the third tuple requires the sort to complete processing
-- before emitting its smallest row. This will take significantly longer.
FETCH NEXT demo; -- 697.278 ms

-- Fetching the fourth tuple (the second from the sorted subquery) is fast again,
-- as the sort has already completed.
FETCH NEXT demo; --   8.205 ms

ROLLBACK;

Enable timing in psql with \timing to see the execution durations.

How Postgres implements it

Query execution follows the same pattern:

ExecutorStart initializes execution. Starting at the root of the plan tree, it calls ExecInitNode (roughly analogous to open()) recursively to set up each node.
ExecutorRun drives execution. It repeatedly calls ExecProcNode on the root node to fetch result tuples. It continues until the root node signals that no more tuples are available.
ExecutorEnd cleans up once execution is finished. It calls ExecEndNode recursively to release resources, corresponding to close().

For a more comprehensive overview, see the executor README.

The function that ties this all together is ExecProcNode. It’s how any node asks its child for the next tuple, the equivalent of calling next():

static inline TupleTableSlot *
ExecProcNode(PlanState *node)
{
  if (node->chgParam != NULL)
    ExecReScan(node);

  // Call the node-specific processing function via function pointer
  return node->ExecProcNode(node);
}

Each node type assigns its own function pointer to the ExecProcNode field. For example, calling it on a sequential scan node it invokes ExecSeqScan; on a sort node, ExecSort; and so on. The functions handling these node executions live in the src/backend/executor directory.

Tracing execution with GDB

Installing Postgres from source

To trace execution, it’s best to compile Postgres from source with debug symbols enabled and optimizations disabled. The official documentation provides detailed instructions. Here, we’ll provide a brief outline of the steps.

Steps

Update package lists and install necessary build dependencies:

sudo apt-get update && sudo apt-get -y install \
    build-essential \
    libreadline-dev \
    zlib1g-dev \
    flex \
    bison \
    libxml2-dev \
    libxslt-dev \
    libssl-dev \
    wget \
    gdb

Download and extract the Postgres source code (replace version number if needed):

wget https://ftp.postgresql.org/pub/source/v18.0/postgresql-18.0.tar.gz
tar -xf postgresql-18.0.tar.gz
cd postgresql-18.0

Configure the build, compile, and install. The --enable-debug option adds debug symbols, CFLAGS='-O0' disables compiler optimizations. For more details on the configure step: https://www.postgresql.org/docs/current/install-make.html#CONFIGURE-OPTIONS

./configure --enable-debug --without-icu CFLAGS='-O0'
make -j$(nproc)
sudo make install

Set up the data directory:

sudo mkdir -p /usr/local/pgsql/data

Create a new user, initialize the data directory and start the Postgres cluster.

sudo useradd --system --create-home postgres
sudo chown postgres:postgres /usr/local/pgsql/data
sudo -iu postgres

# now as postgres:
/usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data
/usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data -l logfile start

Test the installation by connecting with psql:

/usr/local/pgsql/bin/psql -U postgres
# You should get the psql prompt: postgres=#
# Type \q to exit

We can use GDB to trace function calls during query execution and see the control flow. The following script sets breakpoints on selected executor functions and logs each call with indentation to show the call stack depth:

python

import gdb

class LogFunctionBreakPoint(gdb.Breakpoint):
    def __init__(self, func_name):
        super().__init__(func_name)
        self.func_name = func_name
        self.silent = True

    def stop(self):
        frame = gdb.selected_frame()
        depth = 0
        while frame:
            frame = frame.older()
            if frame:
                depth += 1

        indent = '  ' * depth
        print(f"{indent} -> {self.func_name}")

        return False

functions_to_trace = ['ExecutorRun', 'ExecProcNode', 'MultiExecProcNode',
             'ExecSort', 'ExecHashJoin', 'ExecSeqScan', 'MultiExecHash', 'ExecLimit']

for fn in functions_to_trace:
    LogFunctionBreakPoint(fn)

Now, let’s connect to Postgres via psql and create some demo data:

CREATE TABLE demo (a int);

INSERT INTO demo
SELECT s FROM generate_series(1, 3) s;

Find the backend PID and attach GDB:

SELECT pg_backend_pid();

sudo gdb -p 65040 -x script.gdb

Tracing a simple query

Let’s execute a simple LIMIT query:

SELECT *
FROM demo
LIMIT 2;

The trace looks like this:

-> ExecutorRun
      -> ExecProcNode
          -> ExecLimit
            -> ExecProcNode
                -> ExecSeqScan
      -> ExecProcNode
        -> ExecLimit
          -> ExecProcNode
            -> ExecSeqScan
      -> ExecProcNode
        -> ExecLimit

You can see the pull-based flow in action. ExecutorRun asks the root (Limit) for a tuple. Limit asks its child (SeqScan). SeqScan returns a row. That’s one cycle.

The pattern repeats for the second tuple. On the third call, ExecLimit knows it’s already returned 2 rows, so it signals completion without asking SeqScan for anything.

Tracing a more complex query

Now let’s try a join with sorting:

-- Force a hash join
SET enable_nestloop = off;
SET enable_mergejoin = off;

SELECT d1.a
FROM demo d1
JOIN demo d2 ON d1.a = d2.a
ORDER BY d1.a DESC;

The trace is more interesting:

-> ExecutorRun
      -> ExecProcNode
          -> ExecSort
            -> ExecProcNode
                -> ExecHashJoin
                    -> ExecProcNode
                        -> ExecSeqScan
                    -> MultiExecProcNode
                      -> MultiExecHash
                          -> ExecProcNode
                              -> ExecSeqScan
                          -> ExecProcNode
                            -> ExecSeqScan
                          -> ExecProcNode
                            -> ExecSeqScan
                          -> ExecProcNode
                            -> ExecSeqScan
            -> ExecProcNode
              -> ExecHashJoin
                    -> ExecProcNode
                      -> ExecSeqScan
            -> ExecProcNode
              -> ExecHashJoin
                    -> ExecProcNode
                      -> ExecSeqScan
            -> ExecProcNode
              -> ExecHashJoin
                    -> ExecProcNode
                      -> ExecSeqScan
      -> ExecProcNode
        -> ExecSort
      -> ExecProcNode
        -> ExecSort
      -> ExecProcNode
        -> ExecSort

This is where the pipeline breaker shows up clearly. When ExecutorRun first asks Sort for a tuple, Sort can’t return one yet because it needs all of its input before it can determine the final ordering. So, it initiates a cascade of pulls from the HashJoin tree below it.

The HashJoin node pulls from its child sequential scans, builds its hash table, and finds matches. The Sort node must pull from this join until all matched rows have been consumed. Only once the HashJoin is exhausted can Sort begin returning results to the top level. From that point on, each call to ExecSort returns the next row from the already-sorted set.

The trade-off

The cost of this design is per-tuple overhead. Look at the trace for the join query: three rows in each table, and the executor made dozens of function calls to produce the result. Each tuple passes through ExecProcNode, into the node-specific function, and back.

The upside is composability. Every node follows the same pull-based protocol, so adding a new node type is a matter of implementing open, next, and close and letting it slot into any plan tree. If you’re curious how the per-tuple overhead can be mitigated, take a look at vectorized execution.