Subqueries ① — Scalar, IN, and Aggregates in WHERE
Learn SQL subqueries from scratch: scalar subqueries, IN, aggregates in WHERE, correlated subqueries, and the NOT IN + NULL pitfall — hands-on in your browser.
A subquery is a SELECT statement written inside another query.
You can first compute "the average salary of all employees" and then use that result to narrow down to "employees who earn more than the average" — a step-by-step query written in a single statement.
In this article we'll work through the three forms used in the WHERE clause — scalar subqueries, IN subqueries, and aggregates inside WHERE (correlated subqueries) — one by one.
Before the exercises, let's look at the column definitions and a sample of the data in the employee table.
① Run PRAGMA table_info(employee); to check the column names, types, and primary key.
② Run SELECT * FROM employee LIMIT 5; to preview the first 5 rows. Note that some employees have NULL in the dept_id column — we'll deal with that in a later exercise.
SQL Editor
Run a query to see results
Schema
No tables
Scalar subquery — use a single-row, single-column result as a value
A scalar subquery (a subquery whose result is exactly one row and one column) returns a single value, so you can write it directly on the right-hand side of a comparison operator like = or >.
(SELECT AVG(salary) FROM employee) returns one value — "the average salary of all employees" — so WHERE salary > (SELECT AVG(salary) FROM employee) narrows down to "employees who earn more than the average."
The key points are to wrap it in parentheses and to make sure it always fits in one row and one column (returning multiple rows is an error).
Aggregate functions like AVG / MAX / MIN / COUNT return a single value, so they're commonly used in scalar subqueries.
How a scalar subquery flowsThe subquery first computes a single value, and that value is plugged into the comparison in the main query's WHERE. The subquery runs first, then the main query uses its result — a two-step structure.
-- Scalar subquery: get the employees who earn the maximum salarySELECTname, salaryFROM employeeWHERE salary = (SELECTMAX(salary) FROM employee);
Imagine the requirement "list the employees whose salary is higher than the average salary of all employees." (Run it correctly and the explanation appears.)
① Take the 2 columns name and salary from the employee table.
② In the WHERE condition, narrow down to rows where salary is greater than the average salary of all employees. Compute the average salary with a scalar subquery.
③ Sort by salary in descending order.
SQL Editor
Run a query to see results
Schema
No tables
IN subquery — filter by values that live in another table
When a subquery returns multiple rows in a single column, use IN instead of =.
Writing WHERE column IN (SELECT another_column FROM another_table ...) keeps only the rows whose value is in the set returned by the subquery.
The subquery on the right of IN SELECTs only one column (returning multiple columns is an error).
To negate it you use NOT IN, but there's a pitfall: if NULLs are mixed into the subquery side, the result changes (we cover this at the end of the article).
IN subquery — test membership in a setThe subquery builds a set of values, and the main query keeps only the rows that are in that set. Unlike a scalar subquery, IN is allowed to return multiple rows.
-- IN subquery: employees in a department located in TokyoSELECTname, dept_idFROM employeeWHERE dept_id IN (SELECT dept_id FROM department WHERElocation='Tokyo');
Imagine the requirement "list the employees who appear at least once in the sales records (employees who actually booked sales)."
① Take the 2 columns emp_id and name from the employee table.
② Build the set of employee IDs that appear in the sales table with a subquery, and use IN to narrow down to rows where emp_id is in that set.
③ Sort by emp_id in ascending order.
SQL Editor
Run a query to see results
Schema
No tables
Aggregates inside WHERE — compare against department averages with a correlated subquery
Before correlated subqueries, let's get the two-level structure of a subquery straight.
There's a main SELECT that pulls out the employee list, and inside it, in parentheses, you write another SELECT (the subquery).
The subqueries so far were non-correlated subqueries — computed just once at the start, independently of the main SELECT (for example, the company-wide average salary is the same single value no matter which employee you look at).
The two-level structure and non-correlationThe inner subquery is computed just once at the start, and the main SELECT uses that single value the same way for every employee. This is a non-correlated subquery.
By contrast, a subquery that uses a column of the employee the main SELECT is currently processing is called a correlated subquery (a subquery that is recomputed for each employee).
For example, to list "the top earner in the department (the employee whose salary equals their own department's maximum)," the maximum salary differs per department, so for each employee you need the maximum salary of just that department.
In the example below, the main employee is aliased emp and the subquery side dept_member, and WHERE dept_member.dept_id = emp.dept_id correlates them.
When the employee changes, the dept_id passed in also changes, and the result is recomputed.
-- Correlated subquery: employees whose salary equals their own department's max (the top earner in the department)SELECTemp.name, emp.dept_id, emp.salaryFROM employee empWHEREemp.salary= (SELECTMAX(dept_member.salary)FROM employee dept_memberWHEREdept_member.dept_id=emp.dept_id );
Correlated subquery — pass the main employee into the subqueryThis is the structure of the sample code above. From each employee in the main FROM employee emp, dept_id is passed into the subquery, which narrows down to rows in the same department with WHERE dept_member.dept_id = emp.dept_id, then computes that department's max salary MAX(dept_member.salary).
Imagine the requirement "within each department, list the employees whose salary is higher than that department's average salary."
① Alias the employee table as emp and take the 3 columns name, dept_id, and salary.
② Limit to rows where dept_id is not NULL (employees with no department are out of scope).
③ Narrow down to rows where salary is greater than the average salary of the same department as that employee. Compute the department average with a subquery aliased dept_member, correlated with dept_member.dept_id = emp.dept_id.
④ Sort by dept_id ascending, and within the same department by salary descending.
SQL Editor
Run a query to see results
Schema
No tables
The NOT IN and NULL pitfall
The negation of IN is NOT IN, but if NULL is included in the subquery on the right of NOT IN, the result changes drastically.
That's because when it becomes NOT IN (..., NULL), SQL checks "the value is not equal to any of them" in the form value <> ... AND value <> NULL, but value <> NULL resolves to neither true nor false (it's NULL), so the whole AND never becomes true and no rows survive.
Why NOT IN with a NULL mixed in returns 0 rowsNOT IN expands into a chain of inequalities joined by AND. The comparison against NULL, 'value <> NULL', resolves to neither true nor false, so the whole AND never becomes true and no rows survive.
Imagine the requirement "list the names of employees who aren't anyone's manager (boss)." The manager_id column in employee contains NULLs for employees who have no boss. (Run it correctly and the explanation appears.)
① Run the NOT IN in the editor as-is and confirm that the result comes out as 0 rows.
② Exclude the rows where manager_id is NULL on the subquery side, and fix it so the employees who aren't anyone's manager come out correctly.
③ Sort by emp_id in ascending order.
SQL Editor
Run a query to see results
Schema
No tables
QUIZ
Knowledge Check
Answer each question one by one.
Q1What is a scalar subquery (a subquery written on the right of a comparison operator) allowed to return?
Q2Which of the following correctly describes WHERE emp_id IN (SELECT emp_id FROM sales)?
Q3If you put a column that can contain NULL on the right of NOT IN, what tends to happen to the result?