MySQL & Database Design

Time-Series & Analytics Database Design

13 min Lesson 39 of 40

Time-Series & Analytics Database Design

Time-series data is essential for tracking metrics, logs, sensor data, and user analytics. In this lesson, we'll design databases optimized for high-volume time-stamped data, implement efficient partitioning strategies, and explore data warehousing concepts for analytics.

Time-Series Data Characteristics

Time-series databases have unique requirements:

Characteristics:
- High write volume (thousands per second)
- Time-based queries (last hour, day, month)
- Mostly append-only (rarely update old data)
- Data grows continuously
- Aggregations over time periods

Common Use Cases:
- Application performance monitoring (APM)
- IoT sensor data
- Financial trading data
- User behavior analytics
- System metrics and logs
- Website traffic analytics

Time-Series Schema Design

Optimized schema for time-series data:

-- Metrics data (optimized for time-series)
CREATE TABLE metrics (
    id BIGINT UNSIGNED AUTO_INCREMENT,
    metric_name VARCHAR(100) NOT NULL,
    metric_value DOUBLE NOT NULL,
    tags JSON COMMENT 'Additional dimensions: {host, region, etc}',
    recorded_at TIMESTAMP(3) NOT NULL DEFAULT CURRENT_TIMESTAMP(3),
    PRIMARY KEY (recorded_at, id),
    INDEX idx_metric_time (metric_name, recorded_at),
    INDEX idx_recorded (recorded_at)
) ENGINE=InnoDB
PARTITION BY RANGE (UNIX_TIMESTAMP(recorded_at)) (
    PARTITION p202401 VALUES LESS THAN (UNIX_TIMESTAMP('2024-02-01')),
    PARTITION p202402 VALUES LESS THAN (UNIX_TIMESTAMP('2024-03-01')),
    PARTITION p202403 VALUES LESS THAN (UNIX_TIMESTAMP('2024-04-01')),
    PARTITION p202404 VALUES LESS THAN (UNIX_TIMESTAMP('2024-05-01')),
    PARTITION p_future VALUES LESS THAN MAXVALUE
);

-- Page views tracking
CREATE TABLE page_views (
    id BIGINT UNSIGNED AUTO_INCREMENT,
    user_id BIGINT UNSIGNED NULL,
    session_id VARCHAR(100) NOT NULL,
    page_url VARCHAR(500) NOT NULL,
    referrer_url VARCHAR(500),
    user_agent TEXT,
    ip_address VARCHAR(45),
    country_code CHAR(2),
    device_type ENUM('desktop', 'mobile', 'tablet') NOT NULL,
    viewed_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (viewed_at, id),
    INDEX idx_user_time (user_id, viewed_at),
    INDEX idx_session (session_id, viewed_at),
    INDEX idx_url_time (page_url(100), viewed_at)
) ENGINE=InnoDB
PARTITION BY RANGE (TO_DAYS(viewed_at)) (
    PARTITION p_week1 VALUES LESS THAN (TO_DAYS('2024-01-08')),
    PARTITION p_week2 VALUES LESS THAN (TO_DAYS('2024-01-15')),
    PARTITION p_week3 VALUES LESS THAN (TO_DAYS('2024-01-22')),
    PARTITION p_week4 VALUES LESS THAN (TO_DAYS('2024-01-29')),
    PARTITION p_future VALUES LESS THAN MAXVALUE
);

-- Application events/logs
CREATE TABLE app_events (
    id BIGINT UNSIGNED AUTO_INCREMENT,
    event_type VARCHAR(50) NOT NULL,
    event_category VARCHAR(50) NOT NULL,
    user_id BIGINT UNSIGNED NULL,
    properties JSON,
    severity ENUM('info', 'warning', 'error', 'critical') DEFAULT 'info',
    created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (created_at, id),
    INDEX idx_type_time (event_type, created_at),
    INDEX idx_user_time (user_id, created_at),
    INDEX idx_severity (severity, created_at)
) ENGINE=InnoDB
PARTITION BY RANGE (UNIX_TIMESTAMP(created_at)) (
    PARTITION p_day1 VALUES LESS THAN (UNIX_TIMESTAMP('2024-01-02')),
    PARTITION p_day2 VALUES LESS THAN (UNIX_TIMESTAMP('2024-01-03')),
    PARTITION p_day3 VALUES LESS THAN (UNIX_TIMESTAMP('2024-01-04')),
    PARTITION p_future VALUES LESS THAN MAXVALUE
);

-- System metrics (CPU, memory, disk)
CREATE TABLE system_metrics (
    id BIGINT UNSIGNED AUTO_INCREMENT,
    hostname VARCHAR(100) NOT NULL,
    metric_type ENUM('cpu', 'memory', 'disk', 'network') NOT NULL,
    metric_value DOUBLE NOT NULL,
    unit VARCHAR(20) NOT NULL COMMENT '%, MB, GB, etc',
    recorded_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (recorded_at, id),
    INDEX idx_host_type (hostname, metric_type, recorded_at)
) ENGINE=InnoDB
PARTITION BY RANGE (UNIX_TIMESTAMP(recorded_at)) (
    PARTITION p_hour1 VALUES LESS THAN (UNIX_TIMESTAMP('2024-01-01 01:00:00')),
    PARTITION p_hour2 VALUES LESS THAN (UNIX_TIMESTAMP('2024-01-01 02:00:00')),
    PARTITION p_hour3 VALUES LESS THAN (UNIX_TIMESTAMP('2024-01-01 03:00:00')),
    PARTITION p_future VALUES LESS THAN MAXVALUE
);

Partitioning Strategy: Partition by time to enable fast queries on recent data and easy deletion of old data by dropping partitions.

Time-Series Queries

Common patterns for querying time-series data:

-- Get metrics for last hour
SELECT
    metric_name,
    AVG(metric_value) AS avg_value,
    MAX(metric_value) AS max_value,
    MIN(metric_value) AS min_value,
    COUNT(*) AS data_points
FROM metrics
WHERE recorded_at >= NOW() - INTERVAL 1 HOUR
GROUP BY metric_name;

-- Get page views per hour for last 24 hours
SELECT
    DATE_FORMAT(viewed_at, '%Y-%m-%d %H:00:00') AS hour,
    COUNT(*) AS views,
    COUNT(DISTINCT session_id) AS unique_sessions,
    COUNT(DISTINCT user_id) AS unique_users
FROM page_views
WHERE viewed_at >= NOW() - INTERVAL 24 HOUR
GROUP BY hour
ORDER BY hour;

-- Get top pages by views
SELECT
    page_url,
    COUNT(*) AS view_count,
    COUNT(DISTINCT user_id) AS unique_visitors
FROM page_views
WHERE viewed_at >= NOW() - INTERVAL 7 DAY
GROUP BY page_url
ORDER BY view_count DESC
LIMIT 20;

-- Calculate moving average (5-minute windows)
SELECT
    DATE_FORMAT(recorded_at, '%Y-%m-%d %H:%i:00') AS time_bucket,
    metric_name,
    AVG(metric_value) AS avg_value
FROM metrics
WHERE recorded_at >= NOW() - INTERVAL 1 HOUR
    AND metric_name = 'cpu_usage'
GROUP BY time_bucket, metric_name
ORDER BY time_bucket;

-- Detect anomalies (values 2x stddev from mean)
WITH metric_stats AS (
    SELECT
        metric_name,
        AVG(metric_value) AS mean_value,
        STDDEV(metric_value) AS stddev_value
    FROM metrics
    WHERE recorded_at >= NOW() - INTERVAL 1 DAY
    GROUP BY metric_name
)
SELECT
    m.recorded_at,
    m.metric_name,
    m.metric_value,
    ms.mean_value,
    ms.stddev_value
FROM metrics m
JOIN metric_stats ms ON m.metric_name = ms.metric_name
WHERE m.recorded_at >= NOW() - INTERVAL 1 HOUR
    AND ABS(m.metric_value - ms.mean_value) > (2 * ms.stddev_value)
ORDER BY m.recorded_at DESC;

Data Aggregation Tables

Pre-aggregate data for faster queries:

-- Hourly aggregations
CREATE TABLE metrics_hourly (
    metric_name VARCHAR(100) NOT NULL,
    hour_start TIMESTAMP NOT NULL,
    avg_value DOUBLE NOT NULL,
    min_value DOUBLE NOT NULL,
    max_value DOUBLE NOT NULL,
    sum_value DOUBLE NOT NULL,
    count_value BIGINT UNSIGNED NOT NULL,
    PRIMARY KEY (metric_name, hour_start),
    INDEX idx_hour (hour_start)
) ENGINE=InnoDB;

-- Daily aggregations
CREATE TABLE metrics_daily (
    metric_name VARCHAR(100) NOT NULL,
    date DATE NOT NULL,
    avg_value DOUBLE NOT NULL,
    min_value DOUBLE NOT NULL,
    max_value DOUBLE NOT NULL,
    sum_value DOUBLE NOT NULL,
    count_value BIGINT UNSIGNED NOT NULL,
    PRIMARY KEY (metric_name, date),
    INDEX idx_date (date)
) ENGINE=InnoDB;

-- Populate hourly aggregations (run every hour)
INSERT INTO metrics_hourly
SELECT
    metric_name,
    DATE_FORMAT(recorded_at, '%Y-%m-%d %H:00:00') AS hour_start,
    AVG(metric_value) AS avg_value,
    MIN(metric_value) AS min_value,
    MAX(metric_value) AS max_value,
    SUM(metric_value) AS sum_value,
    COUNT(*) AS count_value
FROM metrics
WHERE recorded_at >= DATE_FORMAT(NOW() - INTERVAL 1 HOUR, '%Y-%m-%d %H:00:00')
    AND recorded_at < DATE_FORMAT(NOW(), '%Y-%m-%d %H:00:00')
GROUP BY metric_name, hour_start
ON DUPLICATE KEY UPDATE
    avg_value = VALUES(avg_value),
    min_value = VALUES(min_value),
    max_value = VALUES(max_value),
    sum_value = VALUES(sum_value),
    count_value = VALUES(count_value);

-- Query aggregated data (much faster)
SELECT
    metric_name,
    AVG(avg_value) AS avg_value,
    MAX(max_value) AS max_value,
    MIN(min_value) AS min_value
FROM metrics_hourly
WHERE hour_start >= NOW() - INTERVAL 7 DAY
GROUP BY metric_name;

Performance Tip: Always query aggregated tables for historical data (older than 24 hours) and only query raw tables for real-time data.

Analytics Database (OLAP)

Data warehouse schema for business intelligence:

-- Fact table: Sales transactions
CREATE TABLE fact_sales (
    sale_id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    date_key INT UNSIGNED NOT NULL,
    product_key INT UNSIGNED NOT NULL,
    customer_key INT UNSIGNED NOT NULL,
    store_key INT UNSIGNED NOT NULL,
    quantity INT UNSIGNED NOT NULL,
    unit_price DECIMAL(10, 2) NOT NULL,
    discount_amount DECIMAL(10, 2) DEFAULT 0,
    total_amount DECIMAL(10, 2) NOT NULL,
    profit_amount DECIMAL(10, 2) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_date (date_key),
    INDEX idx_product (product_key),
    INDEX idx_customer (customer_key),
    INDEX idx_store (store_key)
) ENGINE=InnoDB;

-- Dimension table: Date
CREATE TABLE dim_date (
    date_key INT UNSIGNED PRIMARY KEY,
    full_date DATE NOT NULL,
    year INT NOT NULL,
    quarter INT NOT NULL,
    month INT NOT NULL,
    week INT NOT NULL,
    day_of_month INT NOT NULL,
    day_of_week INT NOT NULL,
    day_name VARCHAR(20) NOT NULL,
    month_name VARCHAR(20) NOT NULL,
    is_weekend BOOLEAN NOT NULL,
    is_holiday BOOLEAN DEFAULT FALSE,
    UNIQUE KEY unique_date (full_date),
    INDEX idx_year_month (year, month)
) ENGINE=InnoDB;

-- Dimension table: Product
CREATE TABLE dim_product (
    product_key INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    product_id BIGINT UNSIGNED NOT NULL COMMENT 'ID from OLTP system',
    product_name VARCHAR(255) NOT NULL,
    category VARCHAR(100) NOT NULL,
    subcategory VARCHAR(100),
    brand VARCHAR(100),
    unit_cost DECIMAL(10, 2) NOT NULL,
    is_active BOOLEAN DEFAULT TRUE,
    INDEX idx_category (category),
    INDEX idx_brand (brand)
) ENGINE=InnoDB;

-- Dimension table: Customer
CREATE TABLE dim_customer (
    customer_key INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    customer_id BIGINT UNSIGNED NOT NULL COMMENT 'ID from OLTP system',
    customer_name VARCHAR(200) NOT NULL,
    customer_type ENUM('individual', 'business') NOT NULL,
    country VARCHAR(100) NOT NULL,
    region VARCHAR(100) NOT NULL,
    city VARCHAR(100) NOT NULL,
    customer_since DATE NOT NULL,
    lifetime_value DECIMAL(12, 2) DEFAULT 0,
    INDEX idx_country (country),
    INDEX idx_region (region)
) ENGINE=InnoDB;

-- Dimension table: Store
CREATE TABLE dim_store (
    store_key INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    store_id INT UNSIGNED NOT NULL COMMENT 'ID from OLTP system',
    store_name VARCHAR(200) NOT NULL,
    store_type ENUM('physical', 'online', 'warehouse') NOT NULL,
    country VARCHAR(100) NOT NULL,
    region VARCHAR(100) NOT NULL,
    city VARCHAR(100) NOT NULL,
    opened_date DATE NOT NULL,
    INDEX idx_country (country),
    INDEX idx_type (store_type)
) ENGINE=InnoDB;

Star Schema: Fact table at center connected to dimension tables. Optimized for analytical queries with denormalized dimensions.

Analytics Queries (OLAP)

Complex analytical queries:

-- Sales by month and category
SELECT
    d.year,
    d.month_name,
    p.category,
    SUM(f.total_amount) AS total_sales,
    SUM(f.profit_amount) AS total_profit,
    COUNT(DISTINCT f.customer_key) AS unique_customers
FROM fact_sales f
JOIN dim_date d ON f.date_key = d.date_key
JOIN dim_product p ON f.product_key = p.product_key
WHERE d.year = 2024
GROUP BY d.year, d.month, d.month_name, p.category
ORDER BY d.month, total_sales DESC;

-- Top products by profit
SELECT
    p.product_name,
    p.category,
    SUM(f.quantity) AS units_sold,
    SUM(f.total_amount) AS revenue,
    SUM(f.profit_amount) AS profit,
    ROUND(SUM(f.profit_amount) / SUM(f.total_amount) * 100, 2) AS profit_margin
FROM fact_sales f
JOIN dim_product p ON f.product_key = p.product_key
JOIN dim_date d ON f.date_key = d.date_key
WHERE d.year = 2024
GROUP BY p.product_key, p.product_name, p.category
ORDER BY profit DESC
LIMIT 20;

-- Customer segmentation by value
SELECT
    CASE
        WHEN c.lifetime_value >= 10000 THEN 'VIP'
        WHEN c.lifetime_value >= 5000 THEN 'High Value'
        WHEN c.lifetime_value >= 1000 THEN 'Medium Value'
        ELSE 'Low Value'
    END AS customer_segment,
    COUNT(DISTINCT c.customer_key) AS customer_count,
    AVG(c.lifetime_value) AS avg_lifetime_value,
    SUM(f.total_amount) AS total_sales
FROM dim_customer c
LEFT JOIN fact_sales f ON c.customer_key = f.customer_key
    AND f.date_key >= 20240101
GROUP BY customer_segment
ORDER BY avg_lifetime_value DESC;

-- Year-over-year growth
SELECT
    d.month_name,
    SUM(CASE WHEN d.year = 2024 THEN f.total_amount ELSE 0 END) AS sales_2024,
    SUM(CASE WHEN d.year = 2023 THEN f.total_amount ELSE 0 END) AS sales_2023,
    ROUND(
        (SUM(CASE WHEN d.year = 2024 THEN f.total_amount ELSE 0 END) -
         SUM(CASE WHEN d.year = 2023 THEN f.total_amount ELSE 0 END)) /
        NULLIF(SUM(CASE WHEN d.year = 2023 THEN f.total_amount ELSE 0 END), 0) * 100,
    2) AS growth_percentage
FROM fact_sales f
JOIN dim_date d ON f.date_key = d.date_key
WHERE d.year IN (2023, 2024)
GROUP BY d.month, d.month_name
ORDER BY d.month;

-- Cohort analysis
SELECT
    DATE_FORMAT(c.customer_since, '%Y-%m') AS cohort_month,
    COUNT(DISTINCT c.customer_key) AS cohort_size,
    SUM(f.total_amount) AS total_revenue,
    AVG(f.total_amount) AS avg_revenue_per_customer
FROM dim_customer c
LEFT JOIN fact_sales f ON c.customer_key = f.customer_key
WHERE c.customer_since >= '2024-01-01'
GROUP BY cohort_month
ORDER BY cohort_month;

Data Archiving Strategy

-- Archive old partitions
ALTER TABLE metrics DROP PARTITION p202401;

-- Move to archive table before dropping
CREATE TABLE metrics_archive LIKE metrics;

ALTER TABLE metrics_archive REMOVE PARTITIONING;

INSERT INTO metrics_archive
SELECT * FROM metrics PARTITION (p202401);

-- Then drop the partition
ALTER TABLE metrics DROP PARTITION p202401;

-- Automated cleanup (delete data older than 90 days)
DELETE FROM app_events
WHERE created_at < NOW() - INTERVAL 90 DAY
LIMIT 10000;

-- Batch delete to avoid locks
DELIMITER $$
CREATE PROCEDURE cleanup_old_events()
BEGIN
    DECLARE rows_deleted INT DEFAULT 1;
    WHILE rows_deleted > 0 DO
        DELETE FROM app_events
        WHERE created_at < NOW() - INTERVAL 90 DAY
        LIMIT 1000;
        SET rows_deleted = ROW_COUNT();
        DO SLEEP(1);
    END WHILE;
END$$
DELIMITER ;

Performance Warning: Deleting millions of rows can lock tables. Use partitioning or batch deletes with sleep intervals.

ETL Process (OLTP to OLAP)

-- Extract, Transform, Load from OLTP to OLAP
-- Step 1: Extract new sales from OLTP
CREATE TEMPORARY TABLE temp_new_sales AS
SELECT
    o.id AS order_id,
    DATE(o.created_at) AS sale_date,
    oi.product_id,
    o.user_id AS customer_id,
    1 AS store_id,
    oi.quantity,
    oi.unit_price,
    0 AS discount_amount,
    oi.total_price AS total_amount,
    (oi.total_price - (oi.quantity * p.cost_price)) AS profit_amount
FROM orders o
JOIN order_items oi ON o.id = oi.order_id
JOIN products p ON oi.product_id = p.id
WHERE o.created_at >= CURDATE() - INTERVAL 1 DAY
    AND o.status = 'completed';

-- Step 2: Load into fact table
INSERT INTO fact_sales
(date_key, product_key, customer_key, store_key,
 quantity, unit_price, discount_amount, total_amount, profit_amount)
SELECT
    CAST(DATE_FORMAT(t.sale_date, '%Y%m%d') AS UNSIGNED) AS date_key,
    dp.product_key,
    dc.customer_key,
    ds.store_key,
    t.quantity,
    t.unit_price,
    t.discount_amount,
    t.total_amount,
    t.profit_amount
FROM temp_new_sales t
JOIN dim_product dp ON t.product_id = dp.product_id
JOIN dim_customer dc ON t.customer_id = dc.customer_id
JOIN dim_store ds ON t.store_id = ds.store_id;

Practice Exercise:

Task: Create a user session analytics system.

Requirements:

Track user sessions with start/end times
Record pages viewed per session
Calculate session duration
Identify bounce rate (single page sessions)

Solution:

CREATE TABLE user_sessions (
    id BIGINT UNSIGNED AUTO_INCREMENT,
    session_id VARCHAR(100) UNIQUE NOT NULL,
    user_id BIGINT UNSIGNED NULL,
    started_at TIMESTAMP NOT NULL,
    ended_at TIMESTAMP NULL,
    page_count INT UNSIGNED DEFAULT 0,
    duration_seconds INT UNSIGNED DEFAULT 0,
    device_type ENUM('desktop', 'mobile', 'tablet') NOT NULL,
    PRIMARY KEY (started_at, id),
    INDEX idx_session (session_id),
    INDEX idx_user (user_id, started_at)
) ENGINE=InnoDB;

-- Calculate bounce rate
SELECT
    DATE(started_at) AS date,
    COUNT(*) AS total_sessions,
    SUM(CASE WHEN page_count = 1 THEN 1 ELSE 0 END) AS bounce_sessions,
    ROUND(SUM(CASE WHEN page_count = 1 THEN 1 ELSE 0 END) / COUNT(*) * 100, 2) AS bounce_rate
FROM user_sessions
WHERE started_at >= CURDATE() - INTERVAL 7 DAY
GROUP BY date
ORDER BY date;

-- Average session duration by device
SELECT
    device_type,
    COUNT(*) AS sessions,
    ROUND(AVG(duration_seconds), 0) AS avg_duration_seconds,
    ROUND(AVG(page_count), 1) AS avg_pages_per_session
FROM user_sessions
WHERE started_at >= CURDATE() - INTERVAL 30 DAY
    AND duration_seconds > 0
GROUP BY device_type;

Summary

In this lesson, you learned:

Designing databases for time-series data
Implementing table partitioning for performance
Creating pre-aggregated summary tables
Building star schema for data warehousing
Writing analytical queries for business intelligence
ETL processes from OLTP to OLAP
Data archiving and cleanup strategies

Next Up: In the final lesson, we'll review all database design patterns and best practices covered throughout this tutorial!