Learn by reading through in order

Rewriting IN, EXISTS, and Correlated Subqueries

How IN and EXISTS differ in the plan, how to rewrite correlated subqueries as JOIN + aggregation, and how NOT IN silently returns 0 rows with NULL — all verified in the query plan.

The data used in this article — perf_sales and employee

Two queries that return the same result can still produce different query plans (the steps the database picked for which tables to read in what order, and which indexes to use).

In this article you'll learn when to use IN versus EXISTS, how to eliminate a correlated subquery (a subquery re-evaluated for each outer row), and how to rewrite "keep only rows that do (or don't) have a match" — comparing query plans with EXPLAIN QUERY PLAN along the way.

Before jumping into the exercises, let's check the column definitions and sample data for the two tables used in this article — perf_sales and employee.

① Use PRAGMA table_info(perf_sales); and PRAGMA table_info(employee); to confirm the column definitions for both tables.

② Use SELECT * FROM perf_sales LIMIT 5; and SELECT * FROM employee LIMIT 5; to preview the first 5 rows of each. perf_sales has 50,000 rows, so always add LIMIT when previewing.

SQL Editor

Run a query to see results

IN and EXISTS — two ways to write the same check

There are two ways to ask "is there a matching row in another table?".

WHERE col IN (SELECT ...) checks whether the value is in the set the inner query returns, while WHERE EXISTS (SELECT 1 FROM ... WHERE ...) checks whether the inner query returns at least one row.

In many cases both produce the same result, and the database can treat either as "keep only rows that have a match".

Prepending EXPLAIN QUERY PLAN query; shows the steps the database picked, line by line.

SCAN perf_sales means a full-row walk, SEARCH ... USING INDEX means going straight to the needed rows via an index, and CORRELATED SCALAR SUBQUERY is a subquery that gets re-evaluated for each outer row.

IN and EXISTS — converging on the same plan
WHERE emp_id IN (SELECT emp_id FROM employee WHERE dept_id=1)WHERE EXISTS (SELECT 1 FROM employee e WHERE e.emp_id=p.emp_id AND e.dept_id=1)Keep rows whosevalue is in the setKeep rows whereinner returns a rowOften convergeon the same planSame resultverify plan with EXPLAIN
IN asks whether a value is in a set; EXISTS asks whether the inner query returns at least one row. Most databases collapse both into the same "keep only rows that have a match" plan.
-- Example: count sales handled by employees in dept_id=2, two ways

-- IN form
SELECT COUNT(*) FROM perf_sales
WHERE emp_id IN (SELECT emp_id FROM employee WHERE dept_id = 2);

-- EXISTS form (correlated)
SELECT COUNT(*) FROM perf_sales p
WHERE EXISTS (
  SELECT 1 FROM employee e
  WHERE e.emp_id = p.emp_id AND e.dept_id = 2
);

-- Prepend EXPLAIN QUERY PLAN to see the plan
-- EXPLAIN QUERY PLAN SELECT COUNT(*) FROM perf_sales WHERE ...;

Let's get more concrete. How do IN and EXISTS behave depending on the size of the outer table and the size of the inner subquery and whether an index exists?

The two cases below — and which form is faster in each — are summarized in the diagram that follows.

-- Case A: outer perf_sales (50,000 rows), inner dept_id=1 employees (~5 rows, small and fixed)

-- IN form
SELECT * FROM perf_sales
WHERE emp_id IN (SELECT emp_id FROM employee WHERE dept_id = 1);

-- EXISTS form
SELECT * FROM perf_sales p
WHERE EXISTS (
  SELECT 1 FROM employee e
  WHERE e.emp_id = p.emp_id AND e.dept_id = 1
);
Case A — inner is small and fixed (IN tends to win)
Case Aouter perf_sales (50,000)inner dept_id=1 (~5)Evaluate inner once,reuse the value listTends to re-check innerfor each of 50,000 rowsIN tends to winsmall innercaches easilyIN behaviorEXISTS behaviorfaster
When the inner set (dept_id=1 employees) is small and fixed, IN can evaluate the inner once and reuse the value list — usually faster than EXISTS. The green arrow points to the faster side.
-- Case B: outer employee (30 rows), inner perf_sales (50,000 rows, with an index on emp_id)

-- IN form
SELECT * FROM employee e
WHERE e.emp_id IN (SELECT emp_id FROM perf_sales);

-- EXISTS form
SELECT * FROM employee e
WHERE EXISTS (
  SELECT 1 FROM perf_sales p
  WHERE p.emp_id = e.emp_id
);
Case B — small outer with an index on the inner (EXISTS tends to win)
Case Bouter employee (30)inner perf_sales (50k + index)Building a value listof 50,000 is heavy30 index probes, eachstops at the first matchEXISTS tends to winsmall outer + indexearly exit pays offIN behaviorEXISTS behaviorfaster
When the outer (30 employees) is small and the inner (perf_sales) has an index on the join key, EXISTS can probe the index for each outer row and bail out as soon as it finds one (early exit) — usually faster than IN. The green arrow points to the faster side.

Tip — modern databases show almost no real speed difference

We've sorted things into "IN tends to win / EXISTS tends to win", but that's a tendency rooted in older optimizers.

In modern databases like SQLite, PostgreSQL, and MySQL 8.0+, both IN (SELECT ...) and a correlated EXISTS get rewritten to the same "keep only rows that have a match" plan, and the actual plan and real-world speed end up roughly the same.

In other words, you can pick whichever reads better.

When the plan difference or speed difference actually matters, the rule is to measureEXPLAIN QUERY PLAN and wall-clock time.

Write "the count of perf_sales handled by employees with dept_id 2" in two forms — IN and EXISTS — and compare each query plan. (Run it correctly to reveal the explanation.)

① Prepend EXPLAIN QUERY PLAN to show the plan for the count query in emp_id IN (SELECT ...) form.

② Then show the plan for the same count written in EXISTS form (correlating against employee) — again with EXPLAIN QUERY PLAN.

③ Compare the two plans, looking at the words SCAN / SEARCH / USING INDEX and so on.

SQL Editor

Run a query to see results

Rewriting a correlated subquery as JOIN + aggregation

A correlated subquery in the SELECT list (one re-evaluated per outer row) gets called more and more often as the row count grows.

In EXPLAIN QUERY PLAN it shows up as a CORRELATED SCALAR SUBQUERY line.

In many cases it can be rewritten as a single aggregation prepared in advance and joined in — the per-row re-evaluation goes away.

The rewrite pattern turns "for each outer row, call (SELECT ... WHERE child.key = parent.key)" into "build the aggregation table once with GROUP BY key, then JOIN it to the parent".

The result is the same and the correlation disappears from the plan.

-- Before: per-employee sales total via a correlated subquery
SELECT e.emp_id, e.name,
  (SELECT SUM(p.amount) FROM perf_sales p
   WHERE p.emp_id = e.emp_id) AS total
FROM employee e
ORDER BY e.emp_id;

-- After: build the aggregation once and JOIN
WITH sales_sum AS (
  SELECT emp_id, SUM(amount) AS total
  FROM perf_sales
  GROUP BY emp_id
)
SELECT e.emp_id, e.name, s.total
FROM employee e
LEFT JOIN sales_sum s ON s.emp_id = e.emp_id
ORDER BY e.emp_id;
Correlated subquery → one aggregation + JOIN
BeforeAfterFor each employeecall (SELECT SUM(amount) ...)again — correlated formGROUP BY emp_idon perf_sales onceto build the aggregation→ LEFT JOIN to employeeSubquery re-evaluatedper rowOne aggregationcorrelation gone from plan
Calling a subquery for each outer row can be rewritten by aggregating once on the key and then joining. The re-evaluation disappears and the correlation falls out of the plan.

Run "a list of each employee's perf_sales total" two ways — as a correlated subquery and as a rewrite — and look at how the plan changes.

① Prepend EXPLAIN QUERY PLAN to show the plan for the version that puts (SELECT SUM(amount) FROM perf_sales WHERE same emp_id) in the SELECT list as a correlated subquery.

② Roll the same aggregation into a CTE that does GROUP BY on emp_id once, LEFT JOIN it to employee, and show that plan too with EXPLAIN QUERY PLAN.

③ Confirm that the CORRELATED SCALAR SUBQUERY line in the correlated version disappears in the rewritten version.

SQL Editor

Run a query to see results

The NOT IN NULL pitfall — avoid it with NOT EXISTS

When you ask "no matching row" — an anti-join — NOT IN (SELECT ...) has a pitfall.

If even one NULL slips into the inner set, three-valued logic (TRUE / FALSE / UNKNOWN) makes the whole condition UNKNOWN, and the result comes back as 0 rows.

Meanwhile, NOT EXISTS (SELECT 1 ... WHERE ...) only asks whether the inner produces a row, so NULL doesn't break it and the anti-join works correctly.

When writing an anti-join, use NOT EXISTS — or, if you must use NOT IN, add WHERE col IS NOT NULL inside to strip NULLs out.

This course uses NOT EXISTS as the standard form.

Why NOT IN with a NULL returns 0 rows
emp_id NOT IN (3, 7, NULL)Three-valued logicEvaluating for emp_id = 5→ NOT (5=3 OR 5=7 OR 5=NULL)= NOT (FALSE OR FALSE OR UNKNOWN)= NOT (UNKNOWN) = UNKNOWN→ row doesn't surviveAs long as the inner sethas even one NULL, every rowmixes UNKNOWN into its resultResult is always 0 rows
NOT IN is evaluated internally as NOT (col = val1 OR col = val2 ...). A comparison against NULL is UNKNOWN under three-valued logic, and that propagates outward — no rows survive.
NOT IN's NULL trap and the rewrite to NOT EXISTS
NOT IN (SELECT mgr_emp FROM team)-- mgr_emp has NULLNOT EXISTS (SELECT 1 FROM team t WHERE t.mgr_emp = p.emp_id)NULL makes everythingUNKNOWN → 0 rowsAsks only existence→ NULL doesn't matterUse NOT EXISTSfor anti-joinsOr add IS NOT NULLinside NOT IN
If even one NULL is in the inner set, NOT IN collapses to UNKNOWN overall and returns 0 rows. NOT EXISTS only asks whether a row exists, so NULL doesn't affect it and the anti-join works correctly.
-- Construct an example with NULL in the inner set, using team
-- team(mgr_emp) deliberately includes one NULL
WITH team(mgr_emp) AS (
  VALUES (3), (7), (NULL)
)
-- Pitfall: with NULL present, NOT IN returns 0 rows
SELECT COUNT(*) FROM perf_sales
WHERE emp_id NOT IN (SELECT mgr_emp FROM team);

-- Fix: NOT EXISTS isn't affected by NULL, so the anti-join works
WITH team(mgr_emp) AS (
  VALUES (3), (7), (NULL)
)
SELECT COUNT(*) FROM perf_sales p
WHERE NOT EXISTS (
  SELECT 1 FROM team t WHERE t.mgr_emp = p.emp_id
);

Observe — via row counts — what happens when the inner set contains a NULL, comparing NOT IN and NOT EXISTS. This is the last exercise of the article. Build a small value list called team (three values: 3, 7, NULL) as a CTE, then count "perf_sales rows whose emp_id is not in that list" two ways.

① Build a CTE with 3, 7, NULL using WITH team(mgr_emp) AS (VALUES ...), and count rows with emp_id NOT IN (SELECT mgr_emp FROM team).

② Write the same check as NOT EXISTS (SELECT 1 FROM team t WHERE t.mgr_emp = p.emp_id) and count the rows.

③ Compare the two counts and confirm that the NOT IN version is 0.

SQL Editor

Run a query to see results
QUIZ

Knowledge Check

Answer each question one by one.

Q1Which of the following best describes the relationship between WHERE col IN (SELECT ...) and WHERE EXISTS (SELECT 1 ... WHERE ...)?

Q2What's the typical rewrite for a correlated scalar subquery in the SELECT list (one that calls (SELECT SUM(...) WHERE child = parent) per row)?

Q3When the inner set might contain NULL, what's the safe way to write an anti-join (rows that don't match)?