Index Optimization Strategies
Understanding when and how to use indexes effectively is crucial for database performance. In this lesson, we'll explore advanced indexing concepts including selectivity, the leftmost prefix rule, index hints, and strategies to avoid common pitfalls.
Index Selectivity
Selectivity is the measure of how unique the values in an indexed column are. High selectivity means better index performance.
Selectivity Formula:
Selectivity = COUNT(DISTINCT column) / COUNT(column)
Selectivity Range:
1.0 = Perfect (all values unique - best for indexing)
0.5 = Moderate (50% unique values)
0.0 = Poor (all values the same - worst for indexing)
Calculating Selectivity
-- Check selectivity of different columns in users table
SELECT
'email' as column_name,
COUNT(DISTINCT email) / COUNT(*) as selectivity
FROM users;
-- Result: 0.98 (98% unique - EXCELLENT for indexing)
SELECT
'country' as column_name,
COUNT(DISTINCT country) / COUNT(*) as selectivity
FROM users;
-- Result: 0.0004 (0.04% unique - POOR for indexing)
-- Only 200 countries but 500,000 users
SELECT
'gender' as column_name,
COUNT(DISTINCT gender) / COUNT(*) as selectivity
FROM users;
-- Result: 0.000004 (only 2 values: M/F - VERY POOR for indexing)
Rule of Thumb: Index columns with selectivity > 0.1 (10% unique). Columns with low selectivity like gender, boolean flags, or status fields with few options are poor candidates for standalone indexes.
When to Index Low-Selectivity Columns
-- Don't create standalone index on low-selectivity columns
-- BAD:
CREATE INDEX idx_gender ON users(gender); -- Only 2 values!
-- Good: Use as part of composite index
CREATE INDEX idx_gender_created ON users(gender, created_at);
-- This query benefits:
SELECT * FROM users WHERE gender = 'F' ORDER BY created_at DESC;
-- Filters 50% of rows, then uses index for sorting
The Leftmost Prefix Rule
Understanding the leftmost prefix rule is critical for composite indexes:
-- Create composite index
CREATE INDEX idx_abc ON orders(customer_id, status, created_at);
This index can be used for:
✓ WHERE customer_id = 123
✓ WHERE customer_id = 123 AND status = 'pending'
✓ WHERE customer_id = 123 AND status = 'pending' AND created_at > '2024-01-01'
✓ WHERE customer_id = 123 ORDER BY status
✓ WHERE customer_id = 123 AND status = 'pending' ORDER BY created_at
This index CANNOT be used for:
✗ WHERE status = 'pending' (skips customer_id)
✗ WHERE created_at > '2024-01-01' (skips customer_id)
✗ WHERE status = 'pending' AND created_at > '2024-01-01' (skips customer_id)
Optimizing Composite Index Order
Scenario: E-commerce orders table with these queries
Q1: SELECT * FROM orders WHERE customer_id = 123; (90% of queries)
Q2: SELECT * FROM orders WHERE status = 'pending'; (5% of queries)
Q3: SELECT * FROM orders WHERE customer_id = 123 AND status = 'pending'; (5%)
Option 1: customer_id first (RECOMMENDED)
CREATE INDEX idx_customer_status ON orders(customer_id, status);
✓ Q1: Uses index (customer_id) - covers 90% of queries
✓ Q3: Uses full index (customer_id, status) - covers 5%
✗ Q2: Can't use index efficiently
Option 2: status first (NOT RECOMMENDED)
CREATE INDEX idx_status_customer ON orders(status, customer_id);
✓ Q2: Uses index (status) - covers 5% of queries
✓ Q3: Uses full index (status, customer_id) - covers 5%
✗ Q1: Can't use index efficiently - but this is 90%!
Conclusion: Always optimize for the most frequent queries!
Best Practice: In composite indexes, place the most selective and frequently queried column first. Consider query frequency and cardinality together.
Index Hints
Sometimes MySQL's optimizer chooses the wrong index. You can force it to use a specific index:
USE INDEX Hint
-- Suggest an index (optimizer may still ignore it)
SELECT * FROM orders USE INDEX (idx_customer_status)
WHERE customer_id = 123 AND status = 'pending';
-- Multiple index suggestions
SELECT * FROM orders USE INDEX (idx_customer, idx_status)
WHERE customer_id = 123;
FORCE INDEX Hint
-- Force MySQL to use specific index (strong override)
SELECT * FROM orders FORCE INDEX (idx_customer_status)
WHERE customer_id = 123 AND status = 'pending';
-- Force specific index for JOIN
SELECT o.*, u.name
FROM orders o FORCE INDEX (idx_customer)
JOIN users u ON o.customer_id = u.id
WHERE o.status = 'pending';
IGNORE INDEX Hint
-- Prevent MySQL from using a specific index
SELECT * FROM orders IGNORE INDEX (idx_status)
WHERE status = 'pending' AND created_at > '2024-01-01';
-- Forces MySQL to consider other indexes or full table scan
Warning: Use index hints sparingly! They override the optimizer and may become problematic as data changes. Only use when you've proven the optimizer makes a poor choice consistently.
When to Use Index Hints
Valid Use Cases:
✓ Optimizer consistently chooses wrong index
✓ You've tested and confirmed better performance
✓ Data distribution makes optimizer's statistics inaccurate
✓ Complex queries where optimizer struggles
Don't Use When:
✗ You haven't run EXPLAIN to verify the problem
✗ As a "quick fix" instead of proper optimization
✗ You're guessing which index is better
✗ Data patterns change frequently
Avoiding Full Table Scans
Full table scans (type = "ALL") are the performance enemy. Here's how to avoid them:
Common Causes and Fixes
Cause 1: No Index Exists
-- Problem
SELECT * FROM users WHERE country = 'USA';
-- EXPLAIN shows: type=ALL, key=NULL
-- Fix
CREATE INDEX idx_country ON users(country);
Cause 2: Function on Indexed Column
-- Problem
SELECT * FROM users WHERE YEAR(created_at) = 2024;
-- Function prevents index usage
-- Fix
SELECT * FROM users
WHERE created_at >= '2024-01-01' AND created_at < '2025-01-01';
Cause 3: Implicit Type Conversion
-- Problem (email is VARCHAR, but using integer)
SELECT * FROM users WHERE email = 123456;
-- MySQL converts email to number, preventing index usage
-- Fix (use correct data type)
SELECT * FROM users WHERE email = '123456';
Cause 4: OR with Non-Indexed Column
-- Problem
SELECT * FROM users WHERE email = 'john@example.com' OR phone = '555-1234';
-- If phone has no index, entire query becomes table scan
-- Fix: Create index on phone
CREATE INDEX idx_phone ON users(phone);
-- Or use UNION
SELECT * FROM users WHERE email = 'john@example.com'
UNION
SELECT * FROM users WHERE phone = '555-1234';
Cause 5: Wildcard at Start
-- Problem
SELECT * FROM users WHERE email LIKE '%@example.com';
-- Leading wildcard prevents index usage
-- Fix: Avoid leading wildcards when possible
SELECT * FROM users WHERE email LIKE 'john%'; -- Can use index
Index Merge Optimization
MySQL can combine multiple indexes for a single query:
-- Two separate indexes
CREATE INDEX idx_country ON users(country);
CREATE INDEX idx_age ON users(age);
-- Query using both conditions
SELECT * FROM users WHERE country = 'USA' AND age > 25;
-- EXPLAIN may show:
type: index_merge
key: idx_country,idx_age
Extra: Using intersect(idx_country,idx_age); Using where
Index Merge Types:
- intersection: Combines results from multiple indexes (AND)
- union: Merges results from multiple indexes (OR)
- sort-union: Like union but with sorting
Note: While index merge is better than full table scan, a single composite index is usually more efficient than merging multiple indexes.
-- Better solution:
CREATE INDEX idx_country_age ON users(country, age);
When Indexes Hurt Performance
Too many indexes or poorly chosen indexes can degrade performance:
1. Write-Heavy Tables
-- Table with many indexes
CREATE TABLE logs (
id INT PRIMARY KEY,
user_id INT,
action VARCHAR(50),
created_at TIMESTAMP,
INDEX idx_user (user_id),
INDEX idx_action (action),
INDEX idx_created (created_at),
INDEX idx_user_action (user_id, action),
INDEX idx_user_created (user_id, created_at)
);
-- Problem: Every INSERT updates 6 indexes!
INSERT INTO logs (user_id, action, created_at) VALUES (123, 'login', NOW());
-- This is SLOW for high-volume logging
-- Solution: Reduce indexes for write-heavy tables
-- Keep only essential indexes, consider separate logging system
2. Redundant Indexes
-- Redundant indexes waste space and slow writes
CREATE INDEX idx_email ON users(email);
CREATE INDEX idx_email_name ON users(email, name);
-- idx_email is redundant! idx_email_name covers it due to leftmost prefix
-- Also redundant:
CREATE INDEX idx_a ON table(a);
CREATE INDEX idx_ab ON table(a, b);
CREATE INDEX idx_abc ON table(a, b, c);
-- Only idx_abc is needed! It covers all three patterns.
-- Find redundant indexes
SELECT
table_name,
index_name,
GROUP_CONCAT(column_name ORDER BY seq_in_index) as columns
FROM information_schema.STATISTICS
WHERE table_schema = 'your_database'
GROUP BY table_name, index_name
ORDER BY table_name, index_name;
3. Over-Indexing Small Tables
-- Don't index tiny tables
CREATE TABLE settings (
id INT PRIMARY KEY,
key_name VARCHAR(50),
value TEXT
); -- Only 10 rows
-- Adding indexes here is wasteful
-- Full table scan of 10 rows is faster than index lookup!
Practice Exercise:
Scenario: You have a social media platform with a posts table:
CREATE TABLE posts (
id INT PRIMARY KEY,
user_id INT,
content TEXT,
category VARCHAR(50),
is_published BOOLEAN,
likes_count INT,
created_at TIMESTAMP
);
-- 10 million posts
-- Query patterns (with frequency):
Q1 (60%): SELECT * FROM posts
WHERE user_id = ?
ORDER BY created_at DESC
LIMIT 20;
Q2 (30%): SELECT * FROM posts
WHERE category = ? AND is_published = 1
ORDER BY likes_count DESC
LIMIT 10;
Q3 (8%): SELECT * FROM posts
WHERE is_published = 1
ORDER BY created_at DESC
LIMIT 50;
Q4 (2%): SELECT * FROM posts
WHERE user_id = ? AND category = ?
ORDER BY created_at DESC;
Task: Design an optimal indexing strategy.
Solution:
Step 1: Analyze selectivity
SELECT COUNT(DISTINCT user_id) / COUNT(*) FROM posts;
-- Result: 0.0001 (10,000 users, 10M posts) - Moderate
SELECT COUNT(DISTINCT category) / COUNT(*) FROM posts;
-- Result: 0.00001 (100 categories, 10M posts) - Low
SELECT COUNT(DISTINCT is_published) / COUNT(*) FROM posts;
-- Result: 0.0000001 (2 values) - Very Low
Step 2: Design indexes based on query frequency
-- Index 1: For Q1 (60% of queries) - highest priority
CREATE INDEX idx_user_created ON posts(user_id, created_at);
-- Covers user filter + date sorting
-- Index 2: For Q2 (30% of queries)
CREATE INDEX idx_cat_pub_likes ON posts(category, is_published, likes_count);
-- Covers category + published filter + likes sorting
-- Index 3: For Q3 (8% of queries)
CREATE INDEX idx_published_created ON posts(is_published, created_at);
-- Covers published filter + date sorting
-- Q4 (2%): Can reuse idx_user_created
-- WHERE user_id = ? AND category = ?
-- idx_user_created covers user_id efficiently
-- MySQL will filter category in memory (acceptable for 2% of queries)
Rationale:
- Optimized for most frequent queries (90% covered by first two indexes)
- Avoided redundancy (Q4 can use existing index)
- Put high-selectivity columns first in composites
- Included sorting columns for covering index benefits
Index Monitoring and Maintenance
-- Find unused indexes (MySQL 8.0+)
SELECT * FROM sys.schema_unused_indexes;
-- Find duplicate/redundant indexes
SELECT * FROM sys.schema_redundant_indexes;
-- Check index fragmentation
ANALYZE TABLE users;
-- Rebuild fragmented indexes
ALTER TABLE users ENGINE=InnoDB; -- Rebuilds all indexes
-- Or rebuild specific index
DROP INDEX idx_email ON users;
CREATE INDEX idx_email ON users(email);
Summary
In this lesson, you learned:
- Index selectivity measures uniqueness - aim for >0.1 (10%+ unique values)
- The leftmost prefix rule determines which queries can use composite indexes
- Order composite indexes by query frequency and selectivity
- Use index hints (USE INDEX, FORCE INDEX) sparingly when optimizer fails
- Avoid full table scans by eliminating functions on indexed columns
- Index merge combines multiple indexes but a composite index is usually better
- Too many indexes hurt write performance - remove redundant indexes
- Small tables (<1000 rows) rarely benefit from indexes
- Monitor index usage and remove unused indexes regularly
Next Up: In the next lesson, we'll explore full-text search with FULLTEXT indexes and the MATCH AGAINST syntax for powerful text searching!