Indexing

An index is a data structure that enables fast lookup of rows satisfying a query predicate. Without an index, a query must scan every row (sequential scan). With an index, it can directly locate the relevant rows.

Why Indexes Matter

Without index: SELECT * FROM Student WHERE email = 'alice@example.com' scans all $n$ rows. $O(n)$.

With index on email: lookup in $O(\log n)$ (B-tree) or $O(1)$ expected (hash).

Trade-off: indexes speed up reads but slow down writes (the index must be updated on every INSERT, UPDATE, DELETE). They also consume additional storage.

B-Tree Index

The default index type in PostgreSQL, MySQL, and most relational databases.

Structure: a balanced tree where:

Internal nodes store keys and pointers to children.
Leaf nodes store key-value pairs (key = indexed column, value = row identifier or the row itself).
Leaf nodes are linked in order (B+ tree variant).

Order $m$: each internal node holds up to $m-1$ keys and $m$ pointers. For disk-based B-trees, $m$ is chosen so that a node fits in a single disk page (e.g., 4 KB).

Properties:

All leaves are at the same depth.
Each internal node is at least half full.
Height: $O(\log_m n)$ for $n$ entries. With $m = 100$, a height-4 tree holds $100^4 = 100$ million entries.

Supported operations: equality (=), range (<, >, BETWEEN), prefix match (LIKE 'abc%'), ORDER BY, MIN/MAX.

Not useful for: suffix match (LIKE '%abc'), full-text search, similarity search.

Hash Index

Uses a hash table to map keys to row locations.

Lookup: $O(1)$ expected for equality queries.

Not useful for: range queries (keys are not ordered in the hash table), prefix match, ORDER BY.

PostgreSQL: supports hash indexes; less common in practice (B-tree covers most use cases).

In-memory databases: hash indexes are more attractive (no disk I/O amortization needed for B-tree).

Clustered vs. Non-Clustered Index

Clustered index (primary index): the table data is physically stored in the order of the index key. There can be only one clustered index per table.

MySQL InnoDB: the primary key is always the clustered index. Rows are stored in primary key order in the B+ tree leaf pages.
A lookup by primary key is efficient: no extra I/O to fetch the row.

Non-clustered index (secondary index): a separate B+ tree; leaf pages contain the index key and the row’s primary key (InnoDB) or heap file location (PostgreSQL). Fetching the actual row requires a second lookup.

Covering index: a non-clustered index that contains all columns needed by a query. The query can be answered entirely from the index without fetching rows (index-only scan).

Composite Index

An index on multiple columns: CREATE INDEX idx ON Student(dept_id, gpa).

Prefix rule: the composite index can be used for queries that filter on a prefix of the indexed columns.

WHERE dept_id = 3: uses the index (leftmost column).
WHERE dept_id = 3 AND gpa > 3.5: uses both columns.
WHERE gpa > 3.5: cannot use the index (gpa is not the leftmost column).

Column ordering: place the most selective (highest cardinality) columns first for equality predicates; range predicate column last.

Partial Index

An index on a subset of rows satisfying a predicate.

CREATE INDEX idx_active_users ON Users(email) WHERE status = 'active';

Smaller index; useful when queries always include the predicate.

Functional / Expression Index

Index on a computed expression.

CREATE INDEX idx_lower_email ON Users(LOWER(email));
-- Enables: WHERE LOWER(email) = 'alice@example.com'

Full-Text Index

Supports searching within text content. Tokenizes text into terms; builds an inverted index (term -> set of documents/rows).

PostgreSQL: CREATE INDEX idx_ft ON Articles USING GIN(to_tsvector('english', body)). Query: WHERE to_tsvector('english', body) @@ plainto_tsquery('database index').

Elasticsearch: purpose-built for full-text search. Stores inverted indexes; supports relevance scoring (TF-IDF, BM25), faceting, and aggregations.

Index Selection Guidelines

Create indexes on:

Primary keys (automatic).
Foreign key columns (speed up joins; PostgreSQL does not create these automatically).
Columns used in WHERE, JOIN ON, ORDER BY, GROUP BY clauses.
High-cardinality columns (many distinct values) for equality predicates.

Avoid indexes on:

Small tables (full scan is faster than index lookup + fetch).
Low-cardinality columns (e.g., boolean, gender) for equality predicates.
Columns that are updated very frequently.
Tables with heavy write workloads.

Query Optimizer and Index Use

The query optimizer decides whether to use an index based on estimated cost.

Statistics: the optimizer uses table statistics (row count, column histograms, index cardinality) to estimate the cost of different plans.

EXPLAIN / EXPLAIN ANALYZE:

EXPLAIN ANALYZE
SELECT * FROM Student WHERE dept_id = 3 AND gpa > 3.5;

Output shows the chosen plan, estimated vs. actual row counts, and execution times.

Common issues:

Outdated statistics: run ANALYZE tablename to refresh.
The optimizer may choose a sequential scan over an index scan if the predicate is not selective enough (e.g., matching 50% of rows).
Force index use (as a last resort): /*+ INDEX(Student idx_student_dept) */ (hint syntax varies by database).