MySQL & Database Design

Aggregate Functions & GROUP BY Mastery

13 min Lesson 13 of 40

Aggregate Functions & GROUP BY Mastery

Aggregate functions combined with GROUP BY are the foundation of data analysis and reporting in SQL. They allow you to summarize, analyze, and derive insights from large datasets. Mastering these techniques will enable you to create powerful analytics queries and comprehensive reports.

Core Aggregate Functions Review

COUNT(): Count rows or non-NULL values SUM(): Total of numeric values AVG(): Average of numeric values MIN(): Minimum value MAX(): Maximum value GROUP_CONCAT(): Concatenate values into string (MySQL-specific)

GROUP BY with Multiple Columns

Group by multiple columns to create multi-dimensional aggregations:

-- Sales by region and category SELECT region, category, COUNT(*) AS order_count, SUM(total_amount) AS total_revenue, AVG(total_amount) AS avg_order_value, MIN(order_date) AS first_order, MAX(order_date) AS last_order FROM orders JOIN products ON orders.product_id = products.product_id GROUP BY region, category ORDER BY region, total_revenue DESC;
-- Monthly sales by product category SELECT YEAR(order_date) AS year, MONTH(order_date) AS month, category, COUNT(DISTINCT order_id) AS orders, SUM(quantity * unit_price) AS revenue, COUNT(DISTINCT customer_id) AS unique_customers FROM orders JOIN order_items ON orders.order_id = order_items.order_id JOIN products ON order_items.product_id = products.product_id GROUP BY YEAR(order_date), MONTH(order_date), category ORDER BY year DESC, month DESC, revenue DESC;

ROLLUP - Creating Subtotals and Grand Totals

ROLLUP generates hierarchical subtotals and grand totals:

-- Sales with subtotals by region and category SELECT region, category, SUM(total_amount) AS total_sales, COUNT(*) AS order_count FROM orders JOIN products ON orders.product_id = products.product_id GROUP BY region, category WITH ROLLUP; -- Result includes: -- 1. Individual region + category combinations -- 2. Subtotals for each region (category = NULL) -- 3. Grand total (region = NULL, category = NULL)
-- Multi-level rollup with COALESCE for readability SELECT COALESCE(region, 'TOTAL') AS region, COALESCE(category, 'All Categories') AS category, COUNT(*) AS orders, SUM(total_amount) AS revenue, ROUND(AVG(total_amount), 2) AS avg_order_value FROM orders JOIN products ON orders.product_id = products.product_id GROUP BY region, category WITH ROLLUP;
Understanding ROLLUP: With GROUP BY col1, col2 WITH ROLLUP, MySQL creates aggregations for (col1, col2), (col1, NULL), and (NULL, NULL). This gives you subtotals at each hierarchy level.

GROUPING() Function

GROUPING() identifies which rows are subtotals created by ROLLUP:

-- Use GROUPING to identify summary rows SELECT region, category, SUM(total_amount) AS revenue, GROUPING(region) AS is_region_total, GROUPING(category) AS is_category_total, CASE WHEN GROUPING(region) = 1 AND GROUPING(category) = 1 THEN 'Grand Total' WHEN GROUPING(category) = 1 THEN 'Region Subtotal' ELSE 'Detail' END AS row_type FROM orders JOIN products ON orders.product_id = products.product_id GROUP BY region, category WITH ROLLUP;

GROUP_CONCAT() - String Aggregation

GROUP_CONCAT() combines multiple values into a comma-separated string:

-- List all products ordered by each customer SELECT c.customer_name, COUNT(DISTINCT o.order_id) AS order_count, GROUP_CONCAT(DISTINCT p.product_name ORDER BY p.product_name SEPARATOR ', ') AS products_ordered FROM customers c JOIN orders o ON c.customer_id = o.customer_id JOIN order_items oi ON o.order_id = oi.order_id JOIN products p ON oi.product_id = p.product_id GROUP BY c.customer_id, c.customer_name ORDER BY order_count DESC;
-- Advanced GROUP_CONCAT with custom formatting SELECT category, COUNT(*) AS product_count, GROUP_CONCAT( CONCAT(product_name, ' ($', ROUND(price, 2), ')') ORDER BY price DESC SEPARATOR ' | ' ) AS products_with_prices FROM products GROUP BY category; -- Output: "Premium Widget ($99.99) | Standard Widget ($49.99) | Basic Widget ($19.99)"
GROUP_CONCAT Limit: Default max length is 1024 characters. Increase with SET SESSION group_concat_max_len = 10000; if needed for large result sets.

HAVING vs WHERE

WHERE filters before grouping, HAVING filters after grouping:

-- Correct usage of WHERE and HAVING SELECT category, COUNT(*) AS product_count, AVG(price) AS avg_price, SUM(stock_quantity) AS total_stock FROM products WHERE is_active = 1 -- Filter rows BEFORE grouping GROUP BY category HAVING AVG(price) > 50 -- Filter groups AFTER aggregation AND COUNT(*) >= 5 ORDER BY avg_price DESC;
Rule of Thumb: Use WHERE for filtering individual rows based on column values. Use HAVING for filtering groups based on aggregate results.

Advanced Aggregate Patterns

Conditional Aggregation

-- Pivot-style reporting with conditional aggregation SELECT category, COUNT(*) AS total_products, SUM(CASE WHEN price < 50 THEN 1 ELSE 0 END) AS budget_products, SUM(CASE WHEN price BETWEEN 50 AND 100 THEN 1 ELSE 0 END) AS mid_range_products, SUM(CASE WHEN price > 100 THEN 1 ELSE 0 END) AS premium_products, ROUND(AVG(CASE WHEN price < 50 THEN price END), 2) AS avg_budget_price, ROUND(AVG(CASE WHEN price > 100 THEN price END), 2) AS avg_premium_price FROM products GROUP BY category;
-- Sales metrics by time period SELECT category, SUM(total_amount) AS total_sales, SUM(CASE WHEN YEAR(order_date) = 2024 THEN total_amount ELSE 0 END) AS sales_2024, SUM(CASE WHEN YEAR(order_date) = 2023 THEN total_amount ELSE 0 END) AS sales_2023, ROUND( (SUM(CASE WHEN YEAR(order_date) = 2024 THEN total_amount ELSE 0 END) - SUM(CASE WHEN YEAR(order_date) = 2023 THEN total_amount ELSE 0 END)) * 100.0 / NULLIF(SUM(CASE WHEN YEAR(order_date) = 2023 THEN total_amount ELSE 0 END), 0), 2 ) AS yoy_growth_pct FROM orders JOIN products ON orders.product_id = products.product_id GROUP BY category ORDER BY yoy_growth_pct DESC;

Combining Multiple Aggregates

-- Comprehensive customer analytics SELECT c.customer_id, c.customer_name, COUNT(DISTINCT o.order_id) AS total_orders, COUNT(DISTINCT oi.product_id) AS unique_products, SUM(oi.quantity) AS total_items_purchased, SUM(o.total_amount) AS lifetime_value, ROUND(AVG(o.total_amount), 2) AS avg_order_value, MIN(o.order_date) AS first_order_date, MAX(o.order_date) AS last_order_date, DATEDIFF(MAX(o.order_date), MIN(o.order_date)) AS customer_lifespan_days, ROUND( SUM(o.total_amount) / NULLIF(DATEDIFF(MAX(o.order_date), MIN(o.order_date)), 0) * 365, 2 ) AS annualized_value FROM customers c JOIN orders o ON c.customer_id = o.customer_id JOIN order_items oi ON o.order_id = oi.order_id GROUP BY c.customer_id, c.customer_name HAVING total_orders >= 3 ORDER BY lifetime_value DESC LIMIT 100;

Statistical Aggregates

-- Advanced statistical analysis SELECT category, COUNT(*) AS product_count, ROUND(AVG(price), 2) AS mean_price, ROUND(STDDEV(price), 2) AS std_deviation, ROUND(VARIANCE(price), 2) AS variance, MIN(price) AS min_price, MAX(price) AS max_price, MAX(price) - MIN(price) AS price_range FROM products GROUP BY category ORDER BY mean_price DESC;

Nested Aggregates with Subqueries

-- Find categories with above-average product count SELECT category, COUNT(*) AS product_count FROM products GROUP BY category HAVING COUNT(*) > ( SELECT AVG(cat_count) FROM ( SELECT COUNT(*) AS cat_count FROM products GROUP BY category ) AS category_counts ) ORDER BY product_count DESC;
-- Customers in top 20% by spending WITH customer_spending AS ( SELECT customer_id, SUM(total_amount) AS total_spent, NTILE(5) OVER (ORDER BY SUM(total_amount) DESC) AS spending_quintile FROM orders GROUP BY customer_id ) SELECT c.customer_name, cs.total_spent, cs.spending_quintile FROM customer_spending cs JOIN customers c ON cs.customer_id = c.customer_id WHERE cs.spending_quintile = 1 ORDER BY cs.total_spent DESC;

Real-World Reporting Examples

-- Comprehensive sales dashboard SELECT DATE_FORMAT(o.order_date, '%Y-%m') AS month, COUNT(DISTINCT o.order_id) AS total_orders, COUNT(DISTINCT o.customer_id) AS unique_customers, SUM(oi.quantity) AS items_sold, SUM(o.total_amount) AS revenue, ROUND(AVG(o.total_amount), 2) AS avg_order_value, ROUND(SUM(o.total_amount) / COUNT(DISTINCT o.customer_id), 2) AS revenue_per_customer, COUNT(DISTINCT o.order_id) / COUNT(DISTINCT o.customer_id) AS orders_per_customer FROM orders o JOIN order_items oi ON o.order_id = oi.order_id WHERE o.order_date >= DATE_SUB(CURDATE(), INTERVAL 12 MONTH) GROUP BY DATE_FORMAT(o.order_date, '%Y-%m') ORDER BY month DESC;
-- Product performance matrix SELECT p.category, p.product_name, COUNT(DISTINCT oi.order_id) AS times_ordered, SUM(oi.quantity) AS units_sold, ROUND(SUM(oi.quantity * oi.unit_price), 2) AS total_revenue, ROUND(AVG(oi.unit_price), 2) AS avg_selling_price, MIN(o.order_date) AS first_sale, MAX(o.order_date) AS last_sale, DATEDIFF(CURDATE(), MAX(o.order_date)) AS days_since_last_sale FROM products p LEFT JOIN order_items oi ON p.product_id = oi.product_id LEFT JOIN orders o ON oi.order_id = o.order_id GROUP BY p.product_id, p.category, p.product_name ORDER BY total_revenue DESC;
-- Cohort retention analysis SELECT DATE_FORMAT(first_order, '%Y-%m') AS cohort_month, COUNT(*) AS cohort_size, SUM(CASE WHEN months_active >= 1 THEN 1 ELSE 0 END) AS active_month_1, SUM(CASE WHEN months_active >= 3 THEN 1 ELSE 0 END) AS active_month_3, SUM(CASE WHEN months_active >= 6 THEN 1 ELSE 0 END) AS active_month_6, ROUND(SUM(CASE WHEN months_active >= 3 THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) AS retention_3m_pct, ROUND(SUM(CASE WHEN months_active >= 6 THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) AS retention_6m_pct FROM ( SELECT customer_id, MIN(order_date) AS first_order, TIMESTAMPDIFF(MONTH, MIN(order_date), MAX(order_date)) AS months_active FROM orders GROUP BY customer_id ) AS customer_lifetime GROUP BY DATE_FORMAT(first_order, '%Y-%m') ORDER BY cohort_month DESC;

Performance Optimization Tips

Best Practices:
  • Add indexes on GROUP BY columns for better performance
  • Use WHERE to filter data before aggregation (reduces rows to process)
  • Be cautious with GROUP_CONCAT on large result sets
  • Consider summary tables for frequently-run aggregate queries
  • Use EXPLAIN to analyze query performance
  • Limit result sets with LIMIT when appropriate
-- Optimized aggregation with proper indexing -- Index recommendations: CREATE INDEX idx_orders_date_customer ON orders(order_date, customer_id); CREATE INDEX idx_order_items_order_product ON order_items(order_id, product_id); CREATE INDEX idx_products_category ON products(category); -- Then run aggregation SELECT category, COUNT(*) AS order_count, SUM(total_amount) AS revenue FROM orders JOIN order_items ON orders.order_id = order_items.order_id JOIN products ON order_items.product_id = products.product_id WHERE order_date >= '2024-01-01' GROUP BY category;

Practice Exercise:

Challenge: Create comprehensive reports using aggregate functions:

  1. Monthly revenue report with year-over-year comparison
  2. Product category performance with subtotals using ROLLUP
  3. Customer segmentation by purchase frequency and value

Solutions:

-- 1. Monthly YoY comparison SELECT DATE_FORMAT(order_date, '%Y-%m') AS month, SUM(total_amount) AS revenue, LAG(SUM(total_amount), 12) OVER (ORDER BY DATE_FORMAT(order_date, '%Y-%m')) AS same_month_last_year, ROUND((SUM(total_amount) - LAG(SUM(total_amount), 12) OVER (ORDER BY DATE_FORMAT(order_date, '%Y-%m'))) * 100.0 / LAG(SUM(total_amount), 12) OVER (ORDER BY DATE_FORMAT(order_date, '%Y-%m')), 2) AS yoy_growth_pct FROM orders GROUP BY DATE_FORMAT(order_date, '%Y-%m') ORDER BY month DESC; -- 2. Category performance with ROLLUP SELECT COALESCE(region, 'ALL REGIONS') AS region, COALESCE(category, 'SUBTOTAL') AS category, COUNT(*) AS orders, SUM(total_amount) AS revenue, ROUND(AVG(total_amount), 2) AS avg_order_value FROM orders JOIN products ON orders.product_id = products.product_id GROUP BY region, category WITH ROLLUP; -- 3. Customer segmentation SELECT CASE WHEN order_count >= 10 THEN 'High Frequency' WHEN order_count BETWEEN 5 AND 9 THEN 'Medium Frequency' ELSE 'Low Frequency' END AS frequency_segment, CASE WHEN total_spent >= 5000 THEN 'High Value' WHEN total_spent BETWEEN 1000 AND 4999 THEN 'Medium Value' ELSE 'Low Value' END AS value_segment, COUNT(*) AS customer_count, ROUND(AVG(total_spent), 2) AS avg_lifetime_value FROM ( SELECT customer_id, COUNT(*) AS order_count, SUM(total_amount) AS total_spent FROM orders GROUP BY customer_id ) AS customer_stats GROUP BY frequency_segment, value_segment ORDER BY customer_count DESC;

Summary

In this lesson, you mastered:

  • Multi-column GROUP BY for multi-dimensional aggregations
  • ROLLUP for hierarchical subtotals and grand totals
  • GROUPING() function to identify summary rows
  • GROUP_CONCAT() for string aggregation
  • HAVING vs WHERE for proper filtering
  • Conditional aggregation with CASE statements
  • Complex real-world reporting queries
Congratulations! You've completed Module 2: Advanced Queries. You now have the skills to write sophisticated analytical queries, create comprehensive reports, and extract deep insights from your data. Next, you'll dive into Module 3: Indexes & Performance!

ES
Edrees Salih
23 hours ago

We are still cooking the magic in the way!