SQL for Data Engineering¶

SQL is the universal language for data engineering. This entry covers DE-specific SQL patterns including window functions, CTEs, query optimization, and DWH-oriented techniques.

Query Execution Order¶

FROM -> JOIN -> WHERE -> GROUP BY -> HAVING -> SELECT -> DISTINCT -> ORDER BY -> LIMIT

WHERE filters rows BEFORE aggregation (no aggregate functions)
HAVING filters groups AFTER aggregation (can use aggregates)

Window Functions¶

Compute across rows without collapsing them (unlike GROUP BY).

function() OVER (
    PARTITION BY col    -- define groups
    ORDER BY col        -- ordering within group
    ROWS BETWEEN ...    -- frame specification
)

Ranking¶

ROW_NUMBER() OVER (ORDER BY salary DESC)  -- unique: 1,2,3,4
RANK()       OVER (ORDER BY salary DESC)  -- ties, gaps: 1,1,3,4
DENSE_RANK() OVER (ORDER BY salary DESC)  -- ties, no gaps: 1,1,2,3
NTILE(4)     OVER (ORDER BY salary)       -- divide into buckets

Offset¶

LAG(col, 1)    OVER (ORDER BY date)  -- previous row
LEAD(col, 1)   OVER (ORDER BY date)  -- next row
FIRST_VALUE(col) OVER (ORDER BY date)
LAST_VALUE(col)  OVER (ORDER BY date
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)

Aggregates Over Windows¶

SUM(revenue) OVER (ORDER BY date) AS cumulative      -- running total
AVG(revenue) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS ma_7d
revenue * 100.0 / SUM(revenue) OVER () AS pct_total  -- percent of total

Frame: ROWS vs RANGE¶

ROWS: physical row positions relative to current
RANGE: value differences in ORDER BY column
Default with ORDER BY: RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

Common Patterns¶

Top N Per Group¶

WITH ranked AS (
    SELECT *, ROW_NUMBER() OVER (
        PARTITION BY category ORDER BY revenue DESC
    ) AS rn
    FROM sales
)
SELECT * FROM ranked WHERE rn <= 3;

Deduplication¶

WITH ranked AS (
    SELECT *, ROW_NUMBER() OVER (
        PARTITION BY email ORDER BY created_at DESC
    ) AS rn FROM users
)
SELECT * FROM ranked WHERE rn = 1;

Year-over-Year Change¶

SELECT month, revenue,
    LAG(revenue, 12) OVER (ORDER BY month) AS last_year,
    (revenue - LAG(revenue, 12) OVER (ORDER BY month)) * 100.0
        / LAG(revenue, 12) OVER (ORDER BY month) AS yoy_pct
FROM monthly_stats;

Find Missing Dates¶

WITH dates AS (
    SELECT generate_series(MIN(date), MAX(date), '1 day')::date AS d
    FROM orders
)
SELECT d FROM dates LEFT JOIN orders ON orders.date = d
WHERE orders.date IS NULL;

JOIN Types¶

Type	What
INNER	Only matching rows
LEFT	All from left + matching right
RIGHT	All from right + matching left
FULL	All from both, NULLs where no match
CROSS	Cartesian product
ANTI	`LEFT JOIN ... WHERE b.key IS NULL`
SEMI	`WHERE EXISTS (SELECT 1 FROM b WHERE ...)`

Cartesian Product Warning¶

When join key has duplicates in BOTH tables, rows multiply: 2 left x 2 right = 4 rows. GROUP BY before JOIN to deduplicate.

CTEs and Recursive Queries¶

WITH RECURSIVE hierarchy AS (
    SELECT id, name, manager_id, 1 AS level
    FROM employees WHERE manager_id IS NULL
    UNION ALL
    SELECT e.id, e.name, e.manager_id, h.level + 1
    FROM employees e JOIN hierarchy h ON e.manager_id = h.id
)
SELECT * FROM hierarchy;

NULL Handling¶

COALESCE(x, y, ...)      -- first non-NULL
NULLIF(x, y)             -- NULL if x=y (prevents division by zero)
SELECT total / NULLIF(count, 0) FROM stats;

COUNT(column) excludes NULLs; COUNT(*) counts all
NULL = NULL is NULL, not TRUE. Use IS NULL
Multiple NULLs = one group in DISTINCT/GROUP BY

EXPLAIN¶

EXPLAIN ANALYZE SELECT ...;      -- actual timings
EXPLAIN (BUFFERS, ANALYZE) ...;  -- buffer usage

Tool: explain.tensor.ru for visual plan analysis.

Gotchas¶

LAST_VALUE() with default frame only sees up to current row - must set UNBOUNDED FOLLOWING
Cartesian product silently inflates aggregates after JOIN
COUNT(DISTINCT col) not DISTINCT COUNT(col)
Subqueries with LIMIT 1 return one row or empty - safe for scalar
GROUP BY column order does not matter; ORDER BY column order does