Q1Which of the following best describes the purpose of running ANALYZE?
Statistics and the Optimizer — ANALYZE and Selectivity
Walks through how the database decides whether to use an index or read every row, based on statistics gathered by ANALYZE. You'll see how high-selectivity and low-selectivity columns split the plan, with EXPLAIN QUERY PLAN as the witness.
The dataset for this article — perf_sales (50,000 rows of sales)
This article is about how the database decides which index to use.
The key piece is statistics.
Statistics are distribution facts about your data — "how many rows the table has", "how many distinct values a column holds" — and you gather them with the ANALYZE command.
From those statistics the database computes selectivity (the estimate of what fraction of rows survive a condition) and decides whether to use an index or read the whole table.
You'll work with the perf_sales sales table (50,000 rows).
It has columns with very different cardinalities: emp_id (30 distinct reps), region (4 values: East / West / North / South), product (200 values, P001 through P200), and status (mostly paid, plus pending and refunded).
You'll watch how the plan changes before and after ANALYZE.
ANALYZE and sqlite_stat1 — gathering statistics
Run ANALYZE and the database walks the tables and indexes, gathers statistics, and saves them to an internal table called sqlite_stat1.
sqlite_stat1 records facts like "how many rows, on average, correspond to a single value of this index".
Running ANALYZE builds statistics grounded in your actual data, which helps the optimizer pick accurate execution plans.
Use ANALYZE on its own to cover every table, or ANALYZE table_name to target one specific table.
-- Index two columns with different cardinalities (status=3 values / qty=20 values)
DROP INDEX IF EXISTS ix_status;
DROP INDEX IF EXISTS ix_qty;
CREATE INDEX ix_status ON perf_sales(status);
CREATE INDEX ix_qty ON perf_sales(qty);
-- Gather statistics and look at what's stored (average rows per value)
ANALYZE;
SELECT tbl, idx, stat FROM sqlite_stat1
WHERE tbl = 'perf_sales' ORDER BY idx;
--> ix_status: a row like 50000 16667 (about 16667 rows per value)
--> ix_qty: a row like 50000 2500 (about 2500 rows per value)
Selectivity changes the plan — columns that pay off versus columns that don't
Selectivity is the estimate of "what fraction of all rows survives after a filter".
The smaller that fraction, the higher the selectivity and the more the index pays off.
emp_id = 7 narrows down to about 1,667 rows (around 3% of the table), so it has high selectivity and the index is worth using.
On the other hand, status = 'paid' leaves about 40,000 rows (around 80% of the table), so the selectivity is low and reading every row in order beats hopping through the index one row at a time.
Once statistics are in place, the optimizer (the query planner) reads the selectivity and compares how expensive an index plan would be versus a full scan.
With composite indexes or multiple candidates in play, it relies on the statistics to prefer "the index that narrows down to fewer rows".
The next exercise shows that even with the same table and the same indexes, the plan can change depending on the selectivity of the WHERE column and whether the columns you select are in the index.
-- Index a high-selectivity column (amount: nearly unique) and a low-selectivity one (status: 3 values),
-- then after ANALYZE compare the plans for queries of the same shape
DROP INDEX IF EXISTS ix_amount;
DROP INDEX IF EXISTS ix_status;
CREATE INDEX ix_amount ON perf_sales(amount);
CREATE INDEX ix_status ON perf_sales(status);
ANALYZE;
EXPLAIN QUERY PLAN
SELECT COUNT(*) FROM perf_sales WHERE amount = 500000;
--> SEARCH perf_sales USING INDEX ix_amount (amount=?)
EXPLAIN QUERY PLAN
SELECT COUNT(*) FROM perf_sales WHERE status = 'paid';
--> low selectivity makes a SCAN-leaning plan a possibility
On this course's console, low selectivity can still keep an INDEX SCAN
On the browser console here (SQLite running in memory), step ③'s status = 'paid' plan may stay on SEARCH perf_sales USING INDEX ix_status (status=?). Here's why.
- It runs in memory: on a disk database, the gap between "random I/O through an index" and "a sequential SCAN" is large, which is why low selectivity favors SCAN. In memory, both are fast and the gap shrinks.
- Index entries are sorted by value: ix_status is ordered by status, so the 'paid' block is read as a contiguous range — there's no random access.
- Rowid-based table access is cheap: SQLite can fetch a table row directly via its internal rowid, so even index-driven table reads stay reasonably cheap.
The behavior of a clean switch to `SCAN` when selectivity drops is easier to observe on PostgreSQL or Oracle, RDBMSes with real disk-backed storage and a full cost-based optimizer. Take away from this exercise that there's a principle at play: even with identical indexes, the combination of selectivity and selected columns can flip the optimizer's choice. The exact threshold between SCAN and SEARCH moves with the engine, the presence of disk, and the precision of the statistics.
Cost-based optimization and bind variables — internals of other databases
The machinery you've seen — "estimate selectivity from statistics, then pick the cheapest plan" — is called a cost-based optimizer (CBO).
In large databases like Oracle and SQL Server, this machinery has more layers inside.
One is parsing (analyzing the SQL statement): analyzing a SQL statement for the first time and building a plan is a hard parse, while reusing a previously analyzed plan is a soft parse.
Another is bind variables (bind variable: a mechanism that doesn't put values directly in the SQL but leaves a slot like ? and passes the value at run time), together with bind peeking — peeking at the value the optimizer is given and picking a plan based on how skewed that value is.
These are internal mechanisms of large databases' execution engines. Knowing the concepts makes tuning discussions much easier to follow.
On this course's browser console, you can actually observe how the plan reacts to selectivity (the previous section), but the switch between hard and soft parses and the behavior of bind peeking can't be reproduced here as the callout below explains, so you'll work through the concepts with a diagram and a read-only code sample.
Hard / soft parses and bind peeking are Oracle-style internals
The switch between hard and soft parses and bind peeking (re-picking the plan based on how skewed the passed value is) live inside the execution engines of large databases like Oracle and SQL Server.
This course's browser console doesn't expose the views needed to observe these internal states (think Oracle's V$SQL), so there's no way to demonstrate them in the console.
Here you'll get the concept from the diagram and the read-only code sample below.
On the other hand, the statistics gathering with ANALYZE and the way the plan reacts to selectivity, both covered earlier, are observable for real.
The heart of cost-based optimization — "statistics -> selectivity -> plan choice" — is something you can confirm hands-on in this course's console, so master that first.
-- An idea of how this looks in Oracle (read-only; don't run on this course's console)
-- Bind variables pass the value in later (:s gets filled at run time)
-- SELECT * FROM perf_sales WHERE status = :s;
--
-- Identical SQL strings reuse the plan = soft parse
-- Even slightly different strings rebuild it = hard parse
-- In Oracle you'd inspect parsing via V$SQL.SQL_TEXT, but we won't here
-- What you can actually observe on this course's console:
-- once statistics are gathered, the selectivity estimate is grounded in real data
ANALYZE;
EXPLAIN QUERY PLAN
SELECT COUNT(*) FROM perf_sales WHERE status = 'pending';
--> SEARCH perf_sales USING INDEX ... (status=?)
Knowledge Check
Answer each question one by one.
Q2Which of the following correctly describes selectivity?
Q3Which of the following correctly describes the difference between a hard parse and a soft parse (an internal mechanism in large databases like Oracle)?