Learn by reading through in order

Statistics and the Optimizer — ANALYZE and Selectivity

Walks through how the database decides whether to use an index or read every row, based on statistics gathered by ANALYZE. You'll see how high-selectivity and low-selectivity columns split the plan, with EXPLAIN QUERY PLAN as the witness.

The dataset for this article — perf_sales (50,000 rows of sales)

This article is about how the database decides which index to use.

The key piece is statistics.

Statistics are distribution facts about your data — "how many rows the table has", "how many distinct values a column holds" — and you gather them with the ANALYZE command.

From those statistics the database computes selectivity (the estimate of what fraction of rows survive a condition) and decides whether to use an index or read the whole table.

You'll work with the perf_sales sales table (50,000 rows).

It has columns with very different cardinalities: emp_id (30 distinct reps), region (4 values: East / West / North / South), product (200 values, P001 through P200), and status (mostly paid, plus pending and refunded).

You'll watch how the plan changes before and after ANALYZE.

Before the exercises, check the column definitions, a sample of the data, and how values are spread across each column in perf_sales. Data generation takes a moment, so the first run may pause for a few seconds.

① Use PRAGMA table_info(perf_sales); to view the column names, types, and primary key.

② Use SELECT * FROM perf_sales LIMIT 5; to preview the first 5 rows.

③ Count the rows per region value to see how the values are spread (cardinality).

SQL Editor

Run a query to see results

ANALYZE and sqlite_stat1 — gathering statistics

Run ANALYZE and the database walks the tables and indexes, gathers statistics, and saves them to an internal table called sqlite_stat1.

sqlite_stat1 records facts like "how many rows, on average, correspond to a single value of this index".

Running ANALYZE builds statistics grounded in your actual data, which helps the optimizer pick accurate execution plans.

Use ANALYZE on its own to cover every table, or ANALYZE table_name to target one specific table.

How ANALYZE gathers statistics and shifts the plan
ANALYZEwalks tables and indexesSaves statisticsto sqlite_stat1Optimizer estimatesselectivityPicks an index ora full scanNo statisticsfalls back torough defaultsWith statisticsestimates usereal data
ANALYZE walks the tables and indexes and saves statistics to sqlite_stat1. The optimizer reads those statistics to estimate selectivity and decides whether to use an index or scan the whole table.
-- Index two columns with different cardinalities (status=3 values / qty=20 values)
DROP INDEX IF EXISTS ix_status;
DROP INDEX IF EXISTS ix_qty;
CREATE INDEX ix_status ON perf_sales(status);
CREATE INDEX ix_qty    ON perf_sales(qty);

-- Gather statistics and look at what's stored (average rows per value)
ANALYZE;
SELECT tbl, idx, stat FROM sqlite_stat1
WHERE tbl = 'perf_sales' ORDER BY idx;
--> ix_status: a row like 50000 16667 (about 16667 rows per value)
--> ix_qty:    a row like 50000 2500 (about 2500 rows per value)

You'll actually gather the statistics that capture "how much each index can narrow down" and read the result. Do the index creation, statistics gathering, and verification in a single execution. (Run it correctly to reveal the explanation.)

① Drop any same-named indexes with DROP INDEX IF EXISTS, then use CREATE INDEX to build one index on emp_id and another on region.

② Run ANALYZE; to gather the statistics.

③ From sqlite_stat1, pull the rows where tbl is perf_sales and look at the stat column for each index (the total row count plus the average rows per value).

SQL Editor

Run a query to see results

Selectivity changes the plan — columns that pay off versus columns that don't

Selectivity is the estimate of "what fraction of all rows survives after a filter".

The smaller that fraction, the higher the selectivity and the more the index pays off.

emp_id = 7 narrows down to about 1,667 rows (around 3% of the table), so it has high selectivity and the index is worth using.

On the other hand, status = 'paid' leaves about 40,000 rows (around 80% of the table), so the selectivity is low and reading every row in order beats hopping through the index one row at a time.

Once statistics are in place, the optimizer (the query planner) reads the selectivity and compares how expensive an index plan would be versus a full scan.

With composite indexes or multiple candidates in play, it relies on the statistics to prefer "the index that narrows down to fewer rows".

The next exercise shows that even with the same table and the same indexes, the plan can change depending on the selectivity of the WHERE column and whether the columns you select are in the index.

The fork in the road: selectivity decides the index choice
High selectivityemp_id = 7Low selectivitystatus = 'paid'About 1667 rows left(about 3% of the table)About 40000 rows left(about 80% of the table)Index pays offSEARCH USING INDEXFull scan is cheaperplan becomes SCAN
A high-selectivity column (like emp_id, where one value leaves a few percent of rows) narrows down to a small set via the index, so the plan is SEARCH. A low-selectivity column (like status='paid', which leaves most rows) is cheaper to scan, so the plan becomes SCAN.
-- Index a high-selectivity column (amount: nearly unique) and a low-selectivity one (status: 3 values),
-- then after ANALYZE compare the plans for queries of the same shape
DROP INDEX IF EXISTS ix_amount;
DROP INDEX IF EXISTS ix_status;
CREATE INDEX ix_amount ON perf_sales(amount);
CREATE INDEX ix_status ON perf_sales(status);
ANALYZE;

EXPLAIN QUERY PLAN
SELECT COUNT(*) FROM perf_sales WHERE amount = 500000;
--> SEARCH perf_sales USING INDEX ix_amount (amount=?)

EXPLAIN QUERY PLAN
SELECT COUNT(*) FROM perf_sales WHERE status = 'paid';
--> low selectivity makes a SCAN-leaning plan a possibility

Compare "a condition almost no rows match" with "a condition almost every row matches", phrased as queries that select a column that's not in the index. With COUNT(*), SQLite can finish counting using the index alone, so the index gets used regardless of selectivity. To watch SCAN actually win on a low-selectivity filter, this exercise uses the SELECT sale_id, amount shape — pulling `amount`, which isn't in the index.

① Drop any same-named indexes with DROP INDEX IF EXISTS, build one index on emp_id and another on status, then run ANALYZE; to gather statistics.

② Add EXPLAIN QUERY PLAN and check the plan for the query that selects sale_id and amount for rows with emp_id = 7 (high selectivity: about 1,667 rows / 3%).

③ Then add EXPLAIN QUERY PLAN and check the plan for the same shape with status = 'paid', and read off how the emp_id side and the status side split between SEARCH and SCAN (low selectivity: about 40,000 rows / 80%).

SQL Editor

Run a query to see results

On this course's console, low selectivity can still keep an INDEX SCAN

On the browser console here (SQLite running in memory), step ③'s status = 'paid' plan may stay on SEARCH perf_sales USING INDEX ix_status (status=?). Here's why.

- It runs in memory: on a disk database, the gap between "random I/O through an index" and "a sequential SCAN" is large, which is why low selectivity favors SCAN. In memory, both are fast and the gap shrinks.

- Index entries are sorted by value: ix_status is ordered by status, so the 'paid' block is read as a contiguous range — there's no random access.

- Rowid-based table access is cheap: SQLite can fetch a table row directly via its internal rowid, so even index-driven table reads stay reasonably cheap.

The behavior of a clean switch to `SCAN` when selectivity drops is easier to observe on PostgreSQL or Oracle, RDBMSes with real disk-backed storage and a full cost-based optimizer. Take away from this exercise that there's a principle at play: even with identical indexes, the combination of selectivity and selected columns can flip the optimizer's choice. The exact threshold between SCAN and SEARCH moves with the engine, the presence of disk, and the precision of the statistics.

Cost-based optimization and bind variables — internals of other databases

The machinery you've seen — "estimate selectivity from statistics, then pick the cheapest plan" — is called a cost-based optimizer (CBO).

In large databases like Oracle and SQL Server, this machinery has more layers inside.

One is parsing (analyzing the SQL statement): analyzing a SQL statement for the first time and building a plan is a hard parse, while reusing a previously analyzed plan is a soft parse.

Another is bind variables (bind variable: a mechanism that doesn't put values directly in the SQL but leaves a slot like ? and passes the value at run time), together with bind peeking — peeking at the value the optimizer is given and picking a plan based on how skewed that value is.

These are internal mechanisms of large databases' execution engines. Knowing the concepts makes tuning discussions much easier to follow.

On this course's browser console, you can actually observe how the plan reacts to selectivity (the previous section), but the switch between hard and soft parses and the behavior of bind peeking can't be reproduced here as the callout below explains, so you'll work through the concepts with a diagram and a read-only code sample.

Cost-based optimization, parsing, and bind variables (conceptual)
Receive a SQL statementHard parseanalyze new SQL andbuild a fresh planSoft parsereuse a previouslybuilt planBind variablespass values into ? laterBind peekinglook at how skewed the value isand pick a planCost-based optimizationpick the cheapest plan from statistics
Inside an Oracle-style database: a hard parse builds a plan, and a soft parse reuses one. Bind variables pass values in later, and bind peeking picks a plan based on how skewed the passed value is. Shown here as a conceptual diagram.

Hard / soft parses and bind peeking are Oracle-style internals

The switch between hard and soft parses and bind peeking (re-picking the plan based on how skewed the passed value is) live inside the execution engines of large databases like Oracle and SQL Server.

This course's browser console doesn't expose the views needed to observe these internal states (think Oracle's V$SQL), so there's no way to demonstrate them in the console.

Here you'll get the concept from the diagram and the read-only code sample below.

On the other hand, the statistics gathering with ANALYZE and the way the plan reacts to selectivity, both covered earlier, are observable for real.

The heart of cost-based optimization — "statistics -> selectivity -> plan choice" — is something you can confirm hands-on in this course's console, so master that first.

-- An idea of how this looks in Oracle (read-only; don't run on this course's console)
-- Bind variables pass the value in later (:s gets filled at run time)
-- SELECT * FROM perf_sales WHERE status = :s;
--
-- Identical SQL strings reuse the plan = soft parse
-- Even slightly different strings rebuild it = hard parse
-- In Oracle you'd inspect parsing via V$SQL.SQL_TEXT, but we won't here

-- What you can actually observe on this course's console:
-- once statistics are gathered, the selectivity estimate is grounded in real data
ANALYZE;
EXPLAIN QUERY PLAN
SELECT COUNT(*) FROM perf_sales WHERE status = 'pending';
--> SEARCH perf_sales USING INDEX ... (status=?)

Of cost-based optimization, the part this course's console actually lets you watch is "gathering statistics shifts the estimate", and you'll verify it on the product column (200 values, high selectivity). Do the index creation and the plan checks before and after gathering statistics in a single execution.

① Drop any same-named index with DROP INDEX IF EXISTS, then create an index on the product column.

② Add EXPLAIN QUERY PLAN and check the plan for a query that counts rows where product is 'P050', with no statistics in place.

③ Run ANALYZE; to gather statistics, then run the same query's EXPLAIN QUERY PLAN again. Since product has 200 values and high selectivity, the plan should keep using the index after ANALYZE too.

SQL Editor

Run a query to see results
QUIZ

Knowledge Check

Answer each question one by one.

Q1Which of the following best describes the purpose of running ANALYZE?

Q2Which of the following correctly describes selectivity?

Q3Which of the following correctly describes the difference between a hard parse and a soft parse (an internal mechanism in large databases like Oracle)?