Inside PostgreSQL Query Execution: The Volcano Model

Imagine a production line where nothing moves until the end of the line asks for it. The final station says “I need the next item,” and that request ripples backward. Each station asks the one before it, all the way back to the raw materials. Parts only get pulled through the line when someone downstream actually needs them.

That’s more or less how PostgreSQL executes your queries. By the time a query reaches the executor, it’s already been through several stages: the SQL text has been parsed into a tree, views have been expanded, and the planner has figured out the most efficient way to actually get the data. The planner outputs a tree of plan nodes, each representing a specific operation like scanning, filtering, sorting, or joining. Other database systems sometimes call these operators. Think of each node as a station on our production line.

The Volcano model

The executor runs this tree using the Volcano processing model, a demand-driven execution model used by PostgreSQL (and many other databases). Execution starts at the root of the plan tree (the final station), which requests a tuple from its child. This request cascades down to a leaf node (typically a table or index scan—which fetches the actual data). The tuple then flows back up the tree, each node processing it along the way, until it reaches the root and gets returned to the client. Because this happens one tuple at a time, it’s known as tuple-at-a-time processing

This works because every node follows a standard three-method protocol: open() to set up the node (allocate memory, initialize state, and recursively open any child nodes it depends on), next() to produce and return the next tuple when asked (calling next() on children if it needs input), and close() to clean up resources when processing is complete.

This interface makes node composable. Any node can consume tuples from any other node as long as they adhere to the open/next/close interface. This allows complex plans to be built from simple, reusable blocks.

When the line stalls: pipeline breakers

Most of the time, tuples flow smoothly through the tree. A filter node checks a row and passes it along; a nested loop join emits a combined row the exact moment it finds a match. This is called pipelining.

However, pipeline breakers cannot produce output until they consume their entire input. They act like a station that needs every item before it can begin sorting or assembling.

But some operations can’t produce any output until they’ve consumed all of their input. These are pipeline breakers, and they’re the equivalent of a station that needs to receive every item before it can start sorting or assembling them.

Common examples include sorting (you must see all rows to know which comes first) and the build phase of a hash join (the complete hash table must be built from one input before probing it with the other).

📖 More on pipeline breakers ...

This video, part of a lecture series by Universität Tübingen, is a good resource on the Volcano model and and demonstrates the impact of pipeline breakers.

In short, if you create a cursor for a query that includes a pipeline breaker (like a sort node), you will notice a delay when fetching the first records as the operation first needs to consume all the input data. For example:

begin;

declare demo cursor for
  -- These two rows are available immediately
  select i from generate_series(1, 2) as i
  union all
  -- This subquery needs sorting, acting as a pipeline breaker
  (select i from generate_series(5000000, 3, -1) as i
  order by i
  limit 2);

-- Fetching the first two tuples (from the first part of UNION ALL) is fast
fetch next demo; --  4.558 ms
fetch next demo; -- 7.095 ms

-- Fetching the third tuple requires the sort to complete processing
-- its entire input (5 million rows) before emitting its smallest row.
-- This will take significantly longer.
fetch next demo; -- 697.278 ms

-- Fetching the fourth tuple (the second from the sorted subquery) is fast again,
-- as the sort has already completed.
fetch next demo; -- 8.205 ms

rollback;

Enable timing in psql with \timing to see the execution durations.

Volcano execution model in PostgreSQL

Now, let’s take a look at how the Volcano model is implemented in Postgres. The execution of a query in Postgres follows this pattern:

ExecutorStart: This function initializes the query execution. It calls ExecInitNode recursively on the plan tree, starting from the root. ExecInitNode performs the setup for each node (analogous to open()),
ExecutorRun: This function drives the execution. It enters a loop that repeatedly calls ExecProcNode on the top-level plan node to fetch result tuples (analogous to calling next() on the root). It continues until the top node signals that no more tuples are available.
ExecutorEnd: This function cleans up after execution is complete. It calls ExecEndNode recursively on the tree to release resources, corresponding to the close() operation.

For a more comprehensive overview, see the executor README.

The key function that ties it all together is ExecProcNode. It’s the mechanism by which any node asks its child for the next tuple — the practical equivalent of calling next():

static inline TupleTableSlot *
ExecProcNode(PlanState *node)
{
  if (node->chgParam != NULL)
    ExecReScan(node);

  // Call the node-specific processing function via function pointer
  return node->ExecProcNode(node);
}

Each node type assigns its own function pointer to that ExecProcNode field, so calling it on a SeqScan node invokes ExecSeqScan, calling it on a Sort node invokes ExecSort, and so on. The functions handling these node executions reside in the src/backend/executor directory.

Let’s make this concrete. Given:

select * from demo limit 2;

ExecutorRun calls ExecProcNode on the root node (Limit). That invokes ExecLimit, which calls ExecProcNode on its child (SeqScan), which invokes ExecSeqScan to fetch a tuple. The tuple flows back up. Repeat until the limit is satisfied.

Tracing execution with GDB

Installing Postgres from source

To trace execution, it’s best to compile PostgreSQL from source with debug symbols enabled and optimizations disabled. The official documentation provides detailed instructions. Here, we’ll provide a brief outline of the steps. For an automated approach using Terraform to deploy and install Postgres on a VM in Google Cloud, you can refer to this Gist.

Steps

Update package lists and install necessary build dependencies (run as root or use sudo):

apt-get update && apt-get -y install \
    build-essential \
    libreadline-dev \
    zlib1g-dev \
    flex \
    bison \
    libxml2-dev \
    libxslt-dev \
    libssl-dev \
    wget \
    gdb

Download and extract the PostgreSQL source code (replace version number if needed):

wget https://ftp.postgresql.org/pub/source/v16.3/postgresql-16.3.tar.gz
tar -xf postgresql-16.3.tar.gz
cd postgresql-16.3

Configure the build, compile, and install. The --enable-debug options adds debug symbols, CFLAGS='-O0' disables compiler optimizations. For more details on the configure step: https://www.postgresql.org/docs/current/install-make.html#CONFIGURE-OPTIONS

./configure --enable-debug --without-icu CFLAGS='-O0'
make -j$(nproc)
make install

Set up the data directory:

mkdir -p /usr/local/pgsql/data

Create a new user, initialize the data directory and start the PostgreSQL cluster.

useradd --system postgres
chown -R postgres:postgres /usr/local/pgsql/data
su - postgres

sudo -u postgres /usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data
sudo -u postgres /usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data -l logfile start

Test the installation by connecting with psql:

/usr/local/pgsql/bin/psql -U postgres
# You should get the psql prompt: postgres=#
# Type \q to exit

To better understand how the Volcano model works in practice, we can use GNU Debugger (GDB) to trace function calls during query execution. This approach allows us to see the actual control flow as a query is processed.

The following script sets breakpoints on key executor functions and logs each call with indentation to show the call stack depth:

python

import gdb

class LogFunctionBreakPoint(gdb.Breakpoint):
    def __init__(self, func_name):
        super().__init__(func_name)
        self.func_name = func_name
        self.silent = True

    def stop(self):
        frame = gdb.selected_frame()
        depth = 0
        while frame:
            frame = frame.older()
            if frame:
                depth += 1

        indent = '  ' * depth
        print(f"{indent} -> {self.func_name}")

        return False

functions_to_trace = ['ExecutorRun', 'ExecProcNode', 'MultiExecProcNode',
             'ExecSort', 'ExecHashJoin', 'ExecSeqScan', 'MultiExecHash', 'ExecLimit']

for fn in functions_to_trace:
    LogFunctionBreakPoint(fn)

Now, let’s connect to Postgres via psql and create some demo data:

create table demo (a int);
insert into demo select s from generate_series(1, 3) s;

Find the backend PID and attach GDB:

select pg_backend_pid();

sudo gdb -p 65040 -x script.gdb

Tracing a simple query

Let’s execute our simple LIMIT query again:

select * from demo limit 2;

Here’s what the trace looks like:

-> ExecutorRun
      -> ExecProcNode
          -> ExecLimit
            -> ExecProcNode
                -> ExecSeqScan
      -> ExecProcNode
        -> ExecLimit
          -> ExecProcNode
            -> ExecSeqScan
      -> ExecProcNode
        -> ExecLimit

You can see the pull-based flow in action. ExecutorRun asks the root (Limit) for a tuple. Limit asks its child (SeqScan). SeqScan delivers a row. That’s one cycle.

The pattern repeats for the second tuple. On the third call, ExecLimit knows it’s already returned 2 rows, so it signals completion without asking SeqScan for anything.

Tracing a more complex query

Now let’s try a join with sorting:

-- Force specific join types for demonstration
SET enable_nestloop = off;
SET enable_mergejoin = off;
SET enable_bitmapscan = off;

select d1.a from demo d1 join demo d2 on d1.a = d2.a order by d1.a desc;

The trace gets more interesting:

-> ExecutorRun
      -> ExecProcNode
          -> ExecSort
            -> ExecProcNode
                -> ExecHashJoin
                    -> ExecProcNode
                        -> ExecSeqScan
                    -> MultiExecProcNode
                      -> MultiExecHash
                          -> ExecProcNode
                              -> ExecSeqScan
                          -> ExecProcNode
                            -> ExecSeqScan
                          -> ExecProcNode
                            -> ExecSeqScan
                          -> ExecProcNode
                            -> ExecSeqScan
            -> ExecProcNode
              -> ExecHashJoin
                    -> ExecProcNode
                      -> ExecSeqScan
            -> ExecProcNode
              -> ExecHashJoin
                    -> ExecProcNode
                      -> ExecSeqScan
            -> ExecProcNode
              -> ExecHashJoin
                    -> ExecProcNode
                      -> ExecSeqScan
      -> ExecProcNode
        -> ExecSort
      -> ExecProcNode
        -> ExecSort
      -> ExecProcNode
        -> ExecSort

This is where the pipeline breaker shows up clearly. When ExecutorRun first asks Sort for a tuple, Sort can’t answer yet — it needs all the data first. So it pulls repeatedly from HashJoin, which in turn pulls from its children (building the hash table from one SeqScan, then probing it with rows from the other). Only after HashJoin is exhausted does Sort have everything it needs. From that point on, the subsequent calls to ExecSort return immediately — the sorted results are already in memory, ready to be handed out one at a time.

Wrapping up

The Volcano model gives PostgreSQL a clean, composable way to execute queries: every node follows the same pull-based protocol, and complex plans are just trees of these interchangeable building blocks. That said, the tuple-at-a-time approach does come with overhead — all those function calls add up. This has led analytical databases to explore alternatives like vectorized execution, which processes data in batches rather than individual tuples.