Learn by reading through in order

Join Algorithms and Join Order

On perf_sales joined with employee, you'll read the outer / inner loops of a nested loop join from the plan, add an index to flip the inner side from SCAN to SEARCH, and walk through illustrated views of hash join and sort-merge join used by other databases.

The dataset for this article — perf_sales and employee

When you join two tables, the database picks a join algorithm internally — the procedure it uses to match rows from the two tables.

Different join methods come with different costs.

The dataset is the sales table perf_sales (50,000 rows; emp_id points to a sales rep) and the employee table employee (30 rows; emp_id is effectively the primary key).

When you join the two on emp_id, you'll watch which one shows up as the outer loop in the plan and how the join order shifts depending on whether an index exists.

Before getting into the exercises, check the column definitions and a sample of the data for the two tables this article uses — perf_sales and employee. Generating perf_sales takes a moment, so the first run may pause for a few seconds.

① Use PRAGMA table_info(perf_sales); and PRAGMA table_info(employee); to view the column definitions of both tables.

② Use SELECT * FROM perf_sales LIMIT 5; and SELECT * FROM employee LIMIT 5; to preview the first 5 rows of each.

③ Use SELECT COUNT(*) FROM perf_sales; and SELECT COUNT(*) FROM employee; to confirm the row count gap between them (50,000 vs. 30).

SQL Editor

Run a query to see results

Nested loop join and join order

A nested loop join walks the outer table one row at a time and, for each row, looks up rows in the inner table that satisfy the join condition — a doubly nested loop.

-- Compare the same join query across three join methods
SELECT e.name, s.amount
FROM employee e
JOIN perf_sales s ON s.emp_id = e.emp_id;

-- Nested loop join internal flow
-- for each e in employee:        -- outer loop (30 rows)
--   for each s in perf_sales:    -- inner (without an index, scan all 50,000 rows)
--     if s.emp_id = e.emp_id then add to the join result
How a nested loop join works
Outer loopemployee (30 rows)InnerLook up perf_sales by emp_idJoin resultRow 1emp_id=1 (Alice)Find rows with emp_id=1-> about 1,667 hitsEmit Alice x 1,667rowsRow 2emp_id=2 (Bob)Find rows with emp_id=2-> about 1,667 hitsEmit Bob x 1,667rows...repeat through row 30...through emp_id=30Emit 50,000 rows total
Walk the outer table one row at a time, look up the inner table by the join key, and emit each matching combination. For every employee row, look up perf_sales by emp_id.
  • The outer loop runs as many times as the table has rows, so put the smaller table on the outside
  • The inner side gets looked up repeatedly from each outer row, so index the join key to make it a single-shot lookup
  • In the plan, the table that appears first is the outer, and the one that appears next is the inner
  • The optimizer picks outer / inner automatically from statistics, so the order you write in FROM isn't necessarily honored
Nested loop join and join order
Good join orderBad join orderOuter: employee30 rows -> 30 loopsOuter: perf_sales50,000 rows -> 50k loopsInner: look up perf_salesby the emp_id indexInner: full-scan employeeevery timeFew outer loops,inner is one-shot via indexMany outer loops,total access balloons
Walk the outer table one row at a time, then look up the inner. Putting the smaller employee on the outside and using the emp_id index for the inner perf_sales keeps the work light.
-- Join perf_sales and employee on emp_id (total sales per employee in dept 1)
-- Without an index on perf_sales.emp_id, the plan tends to scan the inner perf_sales in full
EXPLAIN QUERY PLAN
SELECT e.name, SUM(s.amount) AS total
FROM employee e
JOIN perf_sales s ON s.emp_id = e.emp_id
WHERE e.dept_id = 1
GROUP BY e.name;
--> SCAN employee AS e
--> SCAN perf_sales AS s   (no index, so the inner is also fully scanned)

-- Add an index on perf_sales.emp_id and the inner can be looked up via the index
DROP INDEX IF EXISTS ix_sales_emp;
CREATE INDEX ix_sales_emp ON perf_sales(emp_id);
EXPLAIN QUERY PLAN
SELECT e.name, SUM(s.amount) AS total
FROM employee e
JOIN perf_sales s ON s.emp_id = e.emp_id
WHERE e.dept_id = 1
GROUP BY e.name;
--> SCAN employee AS e
--> SEARCH perf_sales AS s USING INDEX ix_sales_emp (emp_id=?)

Look at the plan for a join query that counts sales per sales rep, and read which table is on the outer loop. (Run it correctly and the explanation will appear.)

① Prefix EXPLAIN QUERY PLAN to a query that joins employee and perf_sales on emp_id and counts rows per employee name with GROUP BY, and check its plan.

② The table that appears first in the output is the outer loop, and the one that appears next is the inner. With no index on perf_sales.emp_id yet, read which one ends up outside and how the inner is being read.

SQL Editor

Run a query to see results

Make the inner table of the join lookable via an index and confirm that the plan switches from SCAN to SEARCH ... USING INDEX. Wrap up index creation and the plan check in a single execution.

① Drop any same-named index with DROP INDEX IF EXISTS, then create an index on perf_sales's emp_id column with CREATE INDEX.

② Prefix EXPLAIN QUERY PLAN to a query that joins employee and perf_sales on emp_id and counts rows per employee name, and check its plan. Confirm that employee (30 rows) stays on the outer loop while the inner perf_sales is now looked up via the index.

SQL Editor

Run a query to see results

The optimizer picks the join order automatically from statistics

Join order — which table goes on the outer loop — is, as a rule, picked automatically by the optimizer based on statistics.

If you've gathered statistics with ANALYZE (covered in the previous article), the optimizer can more accurately estimate "which table is smaller" and "whether the inner can be looked up via an index" and optimize the outer / inner assignment.

The order you write in FROM may well be reshuffled by the optimizer.

The trick for reading join order: in EXPLAIN QUERY PLAN output, the table higher up is closer to the outer loop.

When the wrong table ends up on the outer loop and things slow down, the first thing to check is whether the join key on the table you want as the inner has an index.

-- After ANALYZE, outer / inner are decided more accurately from statistics
DROP INDEX IF EXISTS ix_sales_emp;
CREATE INDEX ix_sales_emp ON perf_sales(emp_id);
ANALYZE;

-- Even with perf_sales first in FROM, the optimizer may put the smaller employee on the outside
EXPLAIN QUERY PLAN
SELECT e.name, SUM(s.amount) AS total
FROM perf_sales s
JOIN employee e ON e.emp_id = s.emp_id
WHERE e.dept_id = 1
GROUP BY e.name;
--> SCAN employee AS e
--> SEARCH perf_sales AS s USING INDEX ix_sales_emp (emp_id=?)

Confirm that the optimizer picks the same join order even when you reverse the table order in FROM. Wrap up index creation, statistics collection, and the plan check in a single execution.

① Drop any same-named index with DROP INDEX IF EXISTS, create an index on perf_sales's emp_id, then run ANALYZE; to gather statistics.

② Prefix EXPLAIN QUERY PLAN to a query whose FROM is now FROM perf_sales s JOIN employee e ON ... (the larger table written first in FROM) and counts rows per employee name, and check its plan. Read whether the optimizer still picks the 30-row employee for the outer loop even with the FROM order flipped.

SQL Editor

Run a query to see results

Hash join and sort-merge join — join methods used by other databases

Large databases (PostgreSQL / Oracle / SQL Server and so on) also use hash join (build a hash table from the smaller side and probe the other against it) and sort-merge join (sort both sides by the join key and walk them in lockstep) alongside the nested loop join.

The engine that powers this course's browser console only uses nested loop, so this section covers the two other methods at the diagram-and-concept level.

Hash join — turn the smaller side into a hash table and stream the other through it

-- Same query, conceptually running as a hash join
SELECT e.name, s.amount
FROM employee e
JOIN perf_sales s ON s.emp_id = e.emp_id;

-- Hash join internal flow
-- H := { e.emp_id => e | e in employee }    -- (1) Build (turn employee into a hash table)
-- for each s in perf_sales:                  -- (2) Probe (stream perf_sales through)
--   if H[s.emp_id] exists: add to the join result
How a hash join works
Build phaseSmaller side into a hash tableProbe phaseStream the larger side row by rowJoin resultRead employee(30 rows)Pull perf_sales (50,000 rows)one row at a timeAlice x 1,667 rowsBob x 1,667 rows...Hash tableemp_id -> employee infoLook up the hash tablein one shot by emp_id50,000 rows totalLook up the table
(1) In the build phase, build a hash table keyed by the smaller side's emp_id (employee). (2) In the probe phase, stream the larger side (perf_sales) one row at a time and look it up in the hash table.
  • Putting the smaller table on the build side keeps the hash table small enough to fit in memory, which makes things fast
  • Only works for equality joins (=); range conditions (< / BETWEEN) can't be joined this way
  • Can join large row counts on both sides with a single pass each even without an inner index, so it's often faster than a nested loop
  • Shows up as something like Hash Join in production databases' EXPLAIN

Sort-merge join — sort both sides, then walk them together

-- Same query, conceptually running as a sort-merge join
SELECT e.name, s.amount
FROM employee e
JOIN perf_sales s ON s.emp_id = e.emp_id;

-- Sort-merge join internal flow
-- E := sort(employee by emp_id)        -- (1) Sort (order both by emp_id)
-- S := sort(perf_sales by emp_id)
-- i, j := 0, 0
-- while i < |E| and j < |S|:            -- (2) Merge
--   if E[i].emp_id = S[j].emp_id: add to the join result, j++
--   else if E[i].emp_id < S[j].emp_id: i++
--   else: j++
How a sort-merge join works
Sort phaseOrder both by emp_idMerge phaseWalk both sides togetherJoin resultemployee in emp_id order(1, 2, 3, ..., 30)Advance pointersfrom the start of each sideResult, sortedby emp_idperf_sales in emp_id order(1, 1, ..., 30, 30)On emp_id match,emit the combination50,000 rows total
(1) In the sort phase, order both tables by the join key (emp_id). (2) In the merge phase, advance pointers from the start of each side, and emit each combination where emp_id matches.
  • If both sides are already sorted (lined up by an index, for example), the sort cost is essentially zero and it's extremely fast
  • Handles range joins (join conditions including < or BETWEEN) too — a strength hash join doesn't have
  • The result comes out sorted by the join key, which is convenient for downstream aggregations and merges
  • Shows up as something like Merge Join in production databases' EXPLAIN

Hash join and sort-merge join are join methods of other RDBMSs

Hash join and sort-merge join are join methods provided by large databases like PostgreSQL / Oracle / SQL Server.

The engine that powers this course's browser console only uses nested loop as its join algorithm (it may create a transient internal index to help the join if needed), so hash join and sort-merge join can't be demonstrated live in this console.

For those we settle for the diagrams and read-only code above to capture the idea.

Meanwhile, the nested loop join's outer / inner, join order, and the index-driven flip from SCAN to SEARCH on the inner side, all covered earlier in this article, are observable for real.

The ability to read which join method the database picked rests on the same nested loop and join order understanding.

Production databases will show terms like Hash Join / Merge Join in their EXPLAIN, so start by getting comfortable reading nested loop plans in this course.

-- Below is what EXPLAIN might look like in PostgreSQL (read-only; not executed in this course's console)
-- EXPLAIN SELECT e.name, SUM(s.amount) FROM employee e
--   JOIN perf_sales s ON s.emp_id = e.emp_id GROUP BY e.name;
-- Example plans:
--   Hash Join  (build a hash table from employee, then stream perf_sales)
--   or Merge Join (sort both by emp_id, then match them up)
-- PostgreSQL picks among these methods automatically based on statistics and cost

-- What this course's console can actually demonstrate is this (nested loop):
DROP INDEX IF EXISTS ix_sales_emp;
CREATE INDEX ix_sales_emp ON perf_sales(emp_id);
EXPLAIN QUERY PLAN
SELECT e.name, SUM(s.amount) AS total
FROM employee e
JOIN perf_sales s ON s.emp_id = e.emp_id
WHERE e.dept_id = 1
GROUP BY e.name;
--> SCAN employee / SEARCH perf_sales USING INDEX ix_sales_emp (emp_id=?)

As a wrap-up for this article, check the join plan one more time with the index in place and read the outer loop, the inner index access, and the meaning of the result. Wrap up index creation, the plan check, and the result check in a single execution.

① Drop any same-named index with DROP INDEX IF EXISTS, then create an index on perf_sales's emp_id.

② Prefix EXPLAIN QUERY PLAN to a query that joins employee and perf_sales on emp_id and counts sales per rep name, and check its plan.

③ Run the same aggregation query (without EXPLAIN QUERY PLAN) with ORDER BY cnt DESC LIMIT 5; to pull the top 5 reps by count. Read both the plan and the actual result.

SQL Editor

Run a query to see results
QUIZ

Knowledge Check

Answer each question one by one.

Q1Which of the following correctly describes a nested loop join?

Q2Which of the following describes the basic idea behind picking a join order for a nested loop join?

Q3Which of the following correctly describes hash join and sort-merge join?