Q1Which of the following best describes the relationship between WHERE col IN (SELECT ...) and WHERE EXISTS (SELECT 1 ... WHERE ...)?
Rewriting IN, EXISTS, and Correlated Subqueries
How IN and EXISTS differ in the plan, how to rewrite correlated subqueries as JOIN + aggregation, and how NOT IN silently returns 0 rows with NULL — all verified in the query plan.
The data used in this article — perf_sales and employee
Two queries that return the same result can still produce different query plans (the steps the database picked for which tables to read in what order, and which indexes to use).
In this article you'll learn when to use IN versus EXISTS, how to eliminate a correlated subquery (a subquery re-evaluated for each outer row), and how to rewrite "keep only rows that do (or don't) have a match" — comparing query plans with EXPLAIN QUERY PLAN along the way.
IN and EXISTS — two ways to write the same check
There are two ways to ask "is there a matching row in another table?".
WHERE col IN (SELECT ...) checks whether the value is in the set the inner query returns, while WHERE EXISTS (SELECT 1 FROM ... WHERE ...) checks whether the inner query returns at least one row.
In many cases both produce the same result, and the database can treat either as "keep only rows that have a match".
Prepending EXPLAIN QUERY PLAN query; shows the steps the database picked, line by line.
SCAN perf_sales means a full-row walk, SEARCH ... USING INDEX means going straight to the needed rows via an index, and CORRELATED SCALAR SUBQUERY is a subquery that gets re-evaluated for each outer row.
-- Example: count sales handled by employees in dept_id=2, two ways
-- IN form
SELECT COUNT(*) FROM perf_sales
WHERE emp_id IN (SELECT emp_id FROM employee WHERE dept_id = 2);
-- EXISTS form (correlated)
SELECT COUNT(*) FROM perf_sales p
WHERE EXISTS (
SELECT 1 FROM employee e
WHERE e.emp_id = p.emp_id AND e.dept_id = 2
);
-- Prepend EXPLAIN QUERY PLAN to see the plan
-- EXPLAIN QUERY PLAN SELECT COUNT(*) FROM perf_sales WHERE ...;
Let's get more concrete. How do IN and EXISTS behave depending on the size of the outer table and the size of the inner subquery and whether an index exists?
The two cases below — and which form is faster in each — are summarized in the diagram that follows.
-- Case A: outer perf_sales (50,000 rows), inner dept_id=1 employees (~5 rows, small and fixed)
-- IN form
SELECT * FROM perf_sales
WHERE emp_id IN (SELECT emp_id FROM employee WHERE dept_id = 1);
-- EXISTS form
SELECT * FROM perf_sales p
WHERE EXISTS (
SELECT 1 FROM employee e
WHERE e.emp_id = p.emp_id AND e.dept_id = 1
);
-- Case B: outer employee (30 rows), inner perf_sales (50,000 rows, with an index on emp_id)
-- IN form
SELECT * FROM employee e
WHERE e.emp_id IN (SELECT emp_id FROM perf_sales);
-- EXISTS form
SELECT * FROM employee e
WHERE EXISTS (
SELECT 1 FROM perf_sales p
WHERE p.emp_id = e.emp_id
);
Tip — modern databases show almost no real speed difference
We've sorted things into "IN tends to win / EXISTS tends to win", but that's a tendency rooted in older optimizers.
In modern databases like SQLite, PostgreSQL, and MySQL 8.0+, both IN (SELECT ...) and a correlated EXISTS get rewritten to the same "keep only rows that have a match" plan, and the actual plan and real-world speed end up roughly the same.
In other words, you can pick whichever reads better.
When the plan difference or speed difference actually matters, the rule is to measure — EXPLAIN QUERY PLAN and wall-clock time.
Rewriting a correlated subquery as JOIN + aggregation
A correlated subquery in the SELECT list (one re-evaluated per outer row) gets called more and more often as the row count grows.
In EXPLAIN QUERY PLAN it shows up as a CORRELATED SCALAR SUBQUERY line.
In many cases it can be rewritten as a single aggregation prepared in advance and joined in — the per-row re-evaluation goes away.
The rewrite pattern turns "for each outer row, call (SELECT ... WHERE child.key = parent.key)" into "build the aggregation table once with GROUP BY key, then JOIN it to the parent".
The result is the same and the correlation disappears from the plan.
-- Before: per-employee sales total via a correlated subquery
SELECT e.emp_id, e.name,
(SELECT SUM(p.amount) FROM perf_sales p
WHERE p.emp_id = e.emp_id) AS total
FROM employee e
ORDER BY e.emp_id;
-- After: build the aggregation once and JOIN
WITH sales_sum AS (
SELECT emp_id, SUM(amount) AS total
FROM perf_sales
GROUP BY emp_id
)
SELECT e.emp_id, e.name, s.total
FROM employee e
LEFT JOIN sales_sum s ON s.emp_id = e.emp_id
ORDER BY e.emp_id;
The NOT IN NULL pitfall — avoid it with NOT EXISTS
When you ask "no matching row" — an anti-join — NOT IN (SELECT ...) has a pitfall.
If even one NULL slips into the inner set, three-valued logic (TRUE / FALSE / UNKNOWN) makes the whole condition UNKNOWN, and the result comes back as 0 rows.
Meanwhile, NOT EXISTS (SELECT 1 ... WHERE ...) only asks whether the inner produces a row, so NULL doesn't break it and the anti-join works correctly.
When writing an anti-join, use NOT EXISTS — or, if you must use NOT IN, add WHERE col IS NOT NULL inside to strip NULLs out.
This course uses NOT EXISTS as the standard form.
-- Construct an example with NULL in the inner set, using team
-- team(mgr_emp) deliberately includes one NULL
WITH team(mgr_emp) AS (
VALUES (3), (7), (NULL)
)
-- Pitfall: with NULL present, NOT IN returns 0 rows
SELECT COUNT(*) FROM perf_sales
WHERE emp_id NOT IN (SELECT mgr_emp FROM team);
-- Fix: NOT EXISTS isn't affected by NULL, so the anti-join works
WITH team(mgr_emp) AS (
VALUES (3), (7), (NULL)
)
SELECT COUNT(*) FROM perf_sales p
WHERE NOT EXISTS (
SELECT 1 FROM team t WHERE t.mgr_emp = p.emp_id
);
Knowledge Check
Answer each question one by one.
Q2What's the typical rewrite for a correlated scalar subquery in the SELECT list (one that calls (SELECT SUM(...) WHERE child = parent) per row)?
Q3When the inner set might contain NULL, what's the safe way to write an anti-join (rows that don't match)?