We are still cooking the magic in the way!
MySQL & Database Design
Aggregate Functions & GROUP BY Mastery
Aggregate Functions & GROUP BY Mastery
Aggregate functions combined with GROUP BY are the foundation of data analysis and reporting in SQL. They allow you to summarize, analyze, and derive insights from large datasets. Mastering these techniques will enable you to create powerful analytics queries and comprehensive reports.
Core Aggregate Functions Review
COUNT(): Count rows or non-NULL values
SUM(): Total of numeric values
AVG(): Average of numeric values
MIN(): Minimum value
MAX(): Maximum value
GROUP_CONCAT(): Concatenate values into string (MySQL-specific)
GROUP BY with Multiple Columns
Group by multiple columns to create multi-dimensional aggregations:
-- Sales by region and category
SELECT
region,
category,
COUNT(*) AS order_count,
SUM(total_amount) AS total_revenue,
AVG(total_amount) AS avg_order_value,
MIN(order_date) AS first_order,
MAX(order_date) AS last_order
FROM orders
JOIN products ON orders.product_id = products.product_id
GROUP BY region, category
ORDER BY region, total_revenue DESC;
-- Monthly sales by product category
SELECT
YEAR(order_date) AS year,
MONTH(order_date) AS month,
category,
COUNT(DISTINCT order_id) AS orders,
SUM(quantity * unit_price) AS revenue,
COUNT(DISTINCT customer_id) AS unique_customers
FROM orders
JOIN order_items ON orders.order_id = order_items.order_id
JOIN products ON order_items.product_id = products.product_id
GROUP BY YEAR(order_date), MONTH(order_date), category
ORDER BY year DESC, month DESC, revenue DESC;
ROLLUP - Creating Subtotals and Grand Totals
ROLLUP generates hierarchical subtotals and grand totals:
-- Sales with subtotals by region and category
SELECT
region,
category,
SUM(total_amount) AS total_sales,
COUNT(*) AS order_count
FROM orders
JOIN products ON orders.product_id = products.product_id
GROUP BY region, category WITH ROLLUP;
-- Result includes:
-- 1. Individual region + category combinations
-- 2. Subtotals for each region (category = NULL)
-- 3. Grand total (region = NULL, category = NULL)
-- Multi-level rollup with COALESCE for readability
SELECT
COALESCE(region, 'TOTAL') AS region,
COALESCE(category, 'All Categories') AS category,
COUNT(*) AS orders,
SUM(total_amount) AS revenue,
ROUND(AVG(total_amount), 2) AS avg_order_value
FROM orders
JOIN products ON orders.product_id = products.product_id
GROUP BY region, category WITH ROLLUP;
Understanding ROLLUP: With GROUP BY col1, col2 WITH ROLLUP, MySQL creates aggregations for (col1, col2), (col1, NULL), and (NULL, NULL). This gives you subtotals at each hierarchy level.
GROUPING() Function
GROUPING() identifies which rows are subtotals created by ROLLUP:
-- Use GROUPING to identify summary rows
SELECT
region,
category,
SUM(total_amount) AS revenue,
GROUPING(region) AS is_region_total,
GROUPING(category) AS is_category_total,
CASE
WHEN GROUPING(region) = 1 AND GROUPING(category) = 1 THEN 'Grand Total'
WHEN GROUPING(category) = 1 THEN 'Region Subtotal'
ELSE 'Detail'
END AS row_type
FROM orders
JOIN products ON orders.product_id = products.product_id
GROUP BY region, category WITH ROLLUP;
GROUP_CONCAT() - String Aggregation
GROUP_CONCAT() combines multiple values into a comma-separated string:
-- List all products ordered by each customer
SELECT
c.customer_name,
COUNT(DISTINCT o.order_id) AS order_count,
GROUP_CONCAT(DISTINCT p.product_name ORDER BY p.product_name SEPARATOR ', ') AS products_ordered
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
GROUP BY c.customer_id, c.customer_name
ORDER BY order_count DESC;
-- Advanced GROUP_CONCAT with custom formatting
SELECT
category,
COUNT(*) AS product_count,
GROUP_CONCAT(
CONCAT(product_name, ' ($', ROUND(price, 2), ')')
ORDER BY price DESC
SEPARATOR ' | '
) AS products_with_prices
FROM products
GROUP BY category;
-- Output: "Premium Widget ($99.99) | Standard Widget ($49.99) | Basic Widget ($19.99)"
GROUP_CONCAT Limit: Default max length is 1024 characters. Increase with SET SESSION group_concat_max_len = 10000; if needed for large result sets.
HAVING vs WHERE
WHERE filters before grouping, HAVING filters after grouping:
-- Correct usage of WHERE and HAVING
SELECT
category,
COUNT(*) AS product_count,
AVG(price) AS avg_price,
SUM(stock_quantity) AS total_stock
FROM products
WHERE is_active = 1 -- Filter rows BEFORE grouping
GROUP BY category
HAVING AVG(price) > 50 -- Filter groups AFTER aggregation
AND COUNT(*) >= 5
ORDER BY avg_price DESC;
Rule of Thumb: Use WHERE for filtering individual rows based on column values. Use HAVING for filtering groups based on aggregate results.
Advanced Aggregate Patterns
Conditional Aggregation
-- Pivot-style reporting with conditional aggregation
SELECT
category,
COUNT(*) AS total_products,
SUM(CASE WHEN price < 50 THEN 1 ELSE 0 END) AS budget_products,
SUM(CASE WHEN price BETWEEN 50 AND 100 THEN 1 ELSE 0 END) AS mid_range_products,
SUM(CASE WHEN price > 100 THEN 1 ELSE 0 END) AS premium_products,
ROUND(AVG(CASE WHEN price < 50 THEN price END), 2) AS avg_budget_price,
ROUND(AVG(CASE WHEN price > 100 THEN price END), 2) AS avg_premium_price
FROM products
GROUP BY category;
-- Sales metrics by time period
SELECT
category,
SUM(total_amount) AS total_sales,
SUM(CASE WHEN YEAR(order_date) = 2024 THEN total_amount ELSE 0 END) AS sales_2024,
SUM(CASE WHEN YEAR(order_date) = 2023 THEN total_amount ELSE 0 END) AS sales_2023,
ROUND(
(SUM(CASE WHEN YEAR(order_date) = 2024 THEN total_amount ELSE 0 END) -
SUM(CASE WHEN YEAR(order_date) = 2023 THEN total_amount ELSE 0 END)) * 100.0 /
NULLIF(SUM(CASE WHEN YEAR(order_date) = 2023 THEN total_amount ELSE 0 END), 0),
2
) AS yoy_growth_pct
FROM orders
JOIN products ON orders.product_id = products.product_id
GROUP BY category
ORDER BY yoy_growth_pct DESC;
Combining Multiple Aggregates
-- Comprehensive customer analytics
SELECT
c.customer_id,
c.customer_name,
COUNT(DISTINCT o.order_id) AS total_orders,
COUNT(DISTINCT oi.product_id) AS unique_products,
SUM(oi.quantity) AS total_items_purchased,
SUM(o.total_amount) AS lifetime_value,
ROUND(AVG(o.total_amount), 2) AS avg_order_value,
MIN(o.order_date) AS first_order_date,
MAX(o.order_date) AS last_order_date,
DATEDIFF(MAX(o.order_date), MIN(o.order_date)) AS customer_lifespan_days,
ROUND(
SUM(o.total_amount) / NULLIF(DATEDIFF(MAX(o.order_date), MIN(o.order_date)), 0) * 365,
2
) AS annualized_value
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
GROUP BY c.customer_id, c.customer_name
HAVING total_orders >= 3
ORDER BY lifetime_value DESC
LIMIT 100;
Statistical Aggregates
-- Advanced statistical analysis
SELECT
category,
COUNT(*) AS product_count,
ROUND(AVG(price), 2) AS mean_price,
ROUND(STDDEV(price), 2) AS std_deviation,
ROUND(VARIANCE(price), 2) AS variance,
MIN(price) AS min_price,
MAX(price) AS max_price,
MAX(price) - MIN(price) AS price_range
FROM products
GROUP BY category
ORDER BY mean_price DESC;
Nested Aggregates with Subqueries
-- Find categories with above-average product count
SELECT
category,
COUNT(*) AS product_count
FROM products
GROUP BY category
HAVING COUNT(*) > (
SELECT AVG(cat_count)
FROM (
SELECT COUNT(*) AS cat_count
FROM products
GROUP BY category
) AS category_counts
)
ORDER BY product_count DESC;
-- Customers in top 20% by spending
WITH customer_spending AS (
SELECT
customer_id,
SUM(total_amount) AS total_spent,
NTILE(5) OVER (ORDER BY SUM(total_amount) DESC) AS spending_quintile
FROM orders
GROUP BY customer_id
)
SELECT
c.customer_name,
cs.total_spent,
cs.spending_quintile
FROM customer_spending cs
JOIN customers c ON cs.customer_id = c.customer_id
WHERE cs.spending_quintile = 1
ORDER BY cs.total_spent DESC;
Real-World Reporting Examples
-- Comprehensive sales dashboard
SELECT
DATE_FORMAT(o.order_date, '%Y-%m') AS month,
COUNT(DISTINCT o.order_id) AS total_orders,
COUNT(DISTINCT o.customer_id) AS unique_customers,
SUM(oi.quantity) AS items_sold,
SUM(o.total_amount) AS revenue,
ROUND(AVG(o.total_amount), 2) AS avg_order_value,
ROUND(SUM(o.total_amount) / COUNT(DISTINCT o.customer_id), 2) AS revenue_per_customer,
COUNT(DISTINCT o.order_id) / COUNT(DISTINCT o.customer_id) AS orders_per_customer
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
WHERE o.order_date >= DATE_SUB(CURDATE(), INTERVAL 12 MONTH)
GROUP BY DATE_FORMAT(o.order_date, '%Y-%m')
ORDER BY month DESC;
-- Product performance matrix
SELECT
p.category,
p.product_name,
COUNT(DISTINCT oi.order_id) AS times_ordered,
SUM(oi.quantity) AS units_sold,
ROUND(SUM(oi.quantity * oi.unit_price), 2) AS total_revenue,
ROUND(AVG(oi.unit_price), 2) AS avg_selling_price,
MIN(o.order_date) AS first_sale,
MAX(o.order_date) AS last_sale,
DATEDIFF(CURDATE(), MAX(o.order_date)) AS days_since_last_sale
FROM products p
LEFT JOIN order_items oi ON p.product_id = oi.product_id
LEFT JOIN orders o ON oi.order_id = o.order_id
GROUP BY p.product_id, p.category, p.product_name
ORDER BY total_revenue DESC;
-- Cohort retention analysis
SELECT
DATE_FORMAT(first_order, '%Y-%m') AS cohort_month,
COUNT(*) AS cohort_size,
SUM(CASE WHEN months_active >= 1 THEN 1 ELSE 0 END) AS active_month_1,
SUM(CASE WHEN months_active >= 3 THEN 1 ELSE 0 END) AS active_month_3,
SUM(CASE WHEN months_active >= 6 THEN 1 ELSE 0 END) AS active_month_6,
ROUND(SUM(CASE WHEN months_active >= 3 THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) AS retention_3m_pct,
ROUND(SUM(CASE WHEN months_active >= 6 THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) AS retention_6m_pct
FROM (
SELECT
customer_id,
MIN(order_date) AS first_order,
TIMESTAMPDIFF(MONTH, MIN(order_date), MAX(order_date)) AS months_active
FROM orders
GROUP BY customer_id
) AS customer_lifetime
GROUP BY DATE_FORMAT(first_order, '%Y-%m')
ORDER BY cohort_month DESC;
Performance Optimization Tips
Best Practices:
- Add indexes on GROUP BY columns for better performance
- Use WHERE to filter data before aggregation (reduces rows to process)
- Be cautious with GROUP_CONCAT on large result sets
- Consider summary tables for frequently-run aggregate queries
- Use EXPLAIN to analyze query performance
- Limit result sets with LIMIT when appropriate
-- Optimized aggregation with proper indexing
-- Index recommendations:
CREATE INDEX idx_orders_date_customer ON orders(order_date, customer_id);
CREATE INDEX idx_order_items_order_product ON order_items(order_id, product_id);
CREATE INDEX idx_products_category ON products(category);
-- Then run aggregation
SELECT
category,
COUNT(*) AS order_count,
SUM(total_amount) AS revenue
FROM orders
JOIN order_items ON orders.order_id = order_items.order_id
JOIN products ON order_items.product_id = products.product_id
WHERE order_date >= '2024-01-01'
GROUP BY category;
Practice Exercise:
Challenge: Create comprehensive reports using aggregate functions:
- Monthly revenue report with year-over-year comparison
- Product category performance with subtotals using ROLLUP
- Customer segmentation by purchase frequency and value
Solutions:
-- 1. Monthly YoY comparison
SELECT
DATE_FORMAT(order_date, '%Y-%m') AS month,
SUM(total_amount) AS revenue,
LAG(SUM(total_amount), 12) OVER (ORDER BY DATE_FORMAT(order_date, '%Y-%m')) AS same_month_last_year,
ROUND((SUM(total_amount) - LAG(SUM(total_amount), 12) OVER (ORDER BY DATE_FORMAT(order_date, '%Y-%m'))) * 100.0 /
LAG(SUM(total_amount), 12) OVER (ORDER BY DATE_FORMAT(order_date, '%Y-%m')), 2) AS yoy_growth_pct
FROM orders
GROUP BY DATE_FORMAT(order_date, '%Y-%m')
ORDER BY month DESC;
-- 2. Category performance with ROLLUP
SELECT
COALESCE(region, 'ALL REGIONS') AS region,
COALESCE(category, 'SUBTOTAL') AS category,
COUNT(*) AS orders,
SUM(total_amount) AS revenue,
ROUND(AVG(total_amount), 2) AS avg_order_value
FROM orders
JOIN products ON orders.product_id = products.product_id
GROUP BY region, category WITH ROLLUP;
-- 3. Customer segmentation
SELECT
CASE
WHEN order_count >= 10 THEN 'High Frequency'
WHEN order_count BETWEEN 5 AND 9 THEN 'Medium Frequency'
ELSE 'Low Frequency'
END AS frequency_segment,
CASE
WHEN total_spent >= 5000 THEN 'High Value'
WHEN total_spent BETWEEN 1000 AND 4999 THEN 'Medium Value'
ELSE 'Low Value'
END AS value_segment,
COUNT(*) AS customer_count,
ROUND(AVG(total_spent), 2) AS avg_lifetime_value
FROM (
SELECT customer_id, COUNT(*) AS order_count, SUM(total_amount) AS total_spent
FROM orders
GROUP BY customer_id
) AS customer_stats
GROUP BY frequency_segment, value_segment
ORDER BY customer_count DESC;
Summary
In this lesson, you mastered:
- Multi-column GROUP BY for multi-dimensional aggregations
- ROLLUP for hierarchical subtotals and grand totals
- GROUPING() function to identify summary rows
- GROUP_CONCAT() for string aggregation
- HAVING vs WHERE for proper filtering
- Conditional aggregation with CASE statements
- Complex real-world reporting queries
Congratulations! You've completed Module 2: Advanced Queries. You now have the skills to write sophisticated analytical queries, create comprehensive reports, and extract deep insights from your data. Next, you'll dive into Module 3: Indexes & Performance!