Q1Which of the following correctly describes a nested loop join?
Join Algorithms and Join Order
On perf_sales joined with employee, you'll read the outer / inner loops of a nested loop join from the plan, add an index to flip the inner side from SCAN to SEARCH, and walk through illustrated views of hash join and sort-merge join used by other databases.
The dataset for this article — perf_sales and employee
When you join two tables, the database picks a join algorithm internally — the procedure it uses to match rows from the two tables.
Different join methods come with different costs.
The dataset is the sales table perf_sales (50,000 rows; emp_id points to a sales rep) and the employee table employee (30 rows; emp_id is effectively the primary key).
When you join the two on emp_id, you'll watch which one shows up as the outer loop in the plan and how the join order shifts depending on whether an index exists.
Nested loop join and join order
A nested loop join walks the outer table one row at a time and, for each row, looks up rows in the inner table that satisfy the join condition — a doubly nested loop.
-- Compare the same join query across three join methods
SELECT e.name, s.amount
FROM employee e
JOIN perf_sales s ON s.emp_id = e.emp_id;
-- Nested loop join internal flow
-- for each e in employee: -- outer loop (30 rows)
-- for each s in perf_sales: -- inner (without an index, scan all 50,000 rows)
-- if s.emp_id = e.emp_id then add to the join result
- The outer loop runs as many times as the table has rows, so put the smaller table on the outside
- The inner side gets looked up repeatedly from each outer row, so index the join key to make it a single-shot lookup
- In the plan, the table that appears first is the outer, and the one that appears next is the inner
- The optimizer picks outer / inner automatically from statistics, so the order you write in
FROMisn't necessarily honored
-- Join perf_sales and employee on emp_id (total sales per employee in dept 1)
-- Without an index on perf_sales.emp_id, the plan tends to scan the inner perf_sales in full
EXPLAIN QUERY PLAN
SELECT e.name, SUM(s.amount) AS total
FROM employee e
JOIN perf_sales s ON s.emp_id = e.emp_id
WHERE e.dept_id = 1
GROUP BY e.name;
--> SCAN employee AS e
--> SCAN perf_sales AS s (no index, so the inner is also fully scanned)
-- Add an index on perf_sales.emp_id and the inner can be looked up via the index
DROP INDEX IF EXISTS ix_sales_emp;
CREATE INDEX ix_sales_emp ON perf_sales(emp_id);
EXPLAIN QUERY PLAN
SELECT e.name, SUM(s.amount) AS total
FROM employee e
JOIN perf_sales s ON s.emp_id = e.emp_id
WHERE e.dept_id = 1
GROUP BY e.name;
--> SCAN employee AS e
--> SEARCH perf_sales AS s USING INDEX ix_sales_emp (emp_id=?)
The optimizer picks the join order automatically from statistics
Join order — which table goes on the outer loop — is, as a rule, picked automatically by the optimizer based on statistics.
If you've gathered statistics with ANALYZE (covered in the previous article), the optimizer can more accurately estimate "which table is smaller" and "whether the inner can be looked up via an index" and optimize the outer / inner assignment.
The order you write in FROM may well be reshuffled by the optimizer.
The trick for reading join order: in EXPLAIN QUERY PLAN output, the table higher up is closer to the outer loop.
When the wrong table ends up on the outer loop and things slow down, the first thing to check is whether the join key on the table you want as the inner has an index.
-- After ANALYZE, outer / inner are decided more accurately from statistics
DROP INDEX IF EXISTS ix_sales_emp;
CREATE INDEX ix_sales_emp ON perf_sales(emp_id);
ANALYZE;
-- Even with perf_sales first in FROM, the optimizer may put the smaller employee on the outside
EXPLAIN QUERY PLAN
SELECT e.name, SUM(s.amount) AS total
FROM perf_sales s
JOIN employee e ON e.emp_id = s.emp_id
WHERE e.dept_id = 1
GROUP BY e.name;
--> SCAN employee AS e
--> SEARCH perf_sales AS s USING INDEX ix_sales_emp (emp_id=?)
Hash join and sort-merge join — join methods used by other databases
Large databases (PostgreSQL / Oracle / SQL Server and so on) also use hash join (build a hash table from the smaller side and probe the other against it) and sort-merge join (sort both sides by the join key and walk them in lockstep) alongside the nested loop join.
The engine that powers this course's browser console only uses nested loop, so this section covers the two other methods at the diagram-and-concept level.
Hash join — turn the smaller side into a hash table and stream the other through it
-- Same query, conceptually running as a hash join
SELECT e.name, s.amount
FROM employee e
JOIN perf_sales s ON s.emp_id = e.emp_id;
-- Hash join internal flow
-- H := { e.emp_id => e | e in employee } -- (1) Build (turn employee into a hash table)
-- for each s in perf_sales: -- (2) Probe (stream perf_sales through)
-- if H[s.emp_id] exists: add to the join result
- Putting the smaller table on the build side keeps the hash table small enough to fit in memory, which makes things fast
- Only works for equality joins (
=); range conditions (</BETWEEN) can't be joined this way - Can join large row counts on both sides with a single pass each even without an inner index, so it's often faster than a nested loop
- Shows up as something like
Hash Joinin production databases'EXPLAIN
Sort-merge join — sort both sides, then walk them together
-- Same query, conceptually running as a sort-merge join
SELECT e.name, s.amount
FROM employee e
JOIN perf_sales s ON s.emp_id = e.emp_id;
-- Sort-merge join internal flow
-- E := sort(employee by emp_id) -- (1) Sort (order both by emp_id)
-- S := sort(perf_sales by emp_id)
-- i, j := 0, 0
-- while i < |E| and j < |S|: -- (2) Merge
-- if E[i].emp_id = S[j].emp_id: add to the join result, j++
-- else if E[i].emp_id < S[j].emp_id: i++
-- else: j++
- If both sides are already sorted (lined up by an index, for example), the sort cost is essentially zero and it's extremely fast
- Handles range joins (join conditions including
<orBETWEEN) too — a strength hash join doesn't have - The result comes out sorted by the join key, which is convenient for downstream aggregations and merges
- Shows up as something like
Merge Joinin production databases'EXPLAIN
Hash join and sort-merge join are join methods of other RDBMSs
Hash join and sort-merge join are join methods provided by large databases like PostgreSQL / Oracle / SQL Server.
The engine that powers this course's browser console only uses nested loop as its join algorithm (it may create a transient internal index to help the join if needed), so hash join and sort-merge join can't be demonstrated live in this console.
For those we settle for the diagrams and read-only code above to capture the idea.
Meanwhile, the nested loop join's outer / inner, join order, and the index-driven flip from SCAN to SEARCH on the inner side, all covered earlier in this article, are observable for real.
The ability to read which join method the database picked rests on the same nested loop and join order understanding.
Production databases will show terms like Hash Join / Merge Join in their EXPLAIN, so start by getting comfortable reading nested loop plans in this course.
-- Below is what EXPLAIN might look like in PostgreSQL (read-only; not executed in this course's console)
-- EXPLAIN SELECT e.name, SUM(s.amount) FROM employee e
-- JOIN perf_sales s ON s.emp_id = e.emp_id GROUP BY e.name;
-- Example plans:
-- Hash Join (build a hash table from employee, then stream perf_sales)
-- or Merge Join (sort both by emp_id, then match them up)
-- PostgreSQL picks among these methods automatically based on statistics and cost
-- What this course's console can actually demonstrate is this (nested loop):
DROP INDEX IF EXISTS ix_sales_emp;
CREATE INDEX ix_sales_emp ON perf_sales(emp_id);
EXPLAIN QUERY PLAN
SELECT e.name, SUM(s.amount) AS total
FROM employee e
JOIN perf_sales s ON s.emp_id = e.emp_id
WHERE e.dept_id = 1
GROUP BY e.name;
--> SCAN employee / SEARCH perf_sales USING INDEX ix_sales_emp (emp_id=?)
Knowledge Check
Answer each question one by one.
Q2Which of the following describes the basic idea behind picking a join order for a nested loop join?
Q3Which of the following correctly describes hash join and sort-merge join?