(re)Learning SQL – Window functions

Some time last year I felt that I needed to brush up on SQL. I took a few evenings to skim some courses and online resources. (Stackoverflow, Vertabello Academy, db-fiddle.com)

Below are some of my thoughts and notes from going over the material.

* Window functions
* Rows and range
* Partition by

Window Functions 

I have been using window functions for calculating simple cross table totals and rolling totals.
This is pretty simple and can be achieved just with
SUM(column) OVER()
for getting the total for the whole table.

We can use something like:
SUM(colum) OVER(ORDER BY date_column ROWS BETWEEN UNBOUNDED PREVIOUS AND CURRENT ROW) for the row by row rolling total. Lets get into the last one a bit more.

CREATE TABLE orders (
placed_dt DATE,
total_price NUMERIC
);
INSERT INTO orders(placed_dt, total_price)
VALUES
('2020-01-01',1.23), 
('2020-01-02',2.50), 
('2020-01-03',5.23), 
('2020-01-04',1.50), 
('2020-01-04',9.53), 
('2020-01-06',8.90), 
('2020-01-07',0.23), 
('2020-01-08',5.50), 
('2020-01-09',1.23), 
('2020-01-10',2.60);

SELECT
placed_dt,
SUM(total_price) OVER(
  ORDER BY placed_dt
  ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
from orders


We are calculating a rolling sum of total_price. For each row the sum of current row and all preceding rows. Here OVERdefines the start of the window calculation. That means the sub selection (window) of rows we want the SUM calculation to work on. ORDER BY defines the order of the rows in the window calculation. ROWS can use the following arguments to define the size of the window for the calculation.

UNBOUNDED PRECEDING
CURRENT ROW
UNBOUNDED FOLLOWING
BETWEEN [number] PRECEDING AND [number] FOLLOWING

These can be abbreviated by ommiting either the FOLLOWING or PRECEDING if all we want is to calculate in one direction (until the start or end of the table) in relation to the current row

  • ROWS UNBOUNDED PRECEDING means BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  • ROWS [number] PRECEDING means BETWEEN [number] PRECEDING AND CURRENT ROW
  • ROWS CURRENT ROW means BETWEEN CURRENT ROW AND CURRENT ROW

Window functions nested in calculated fields

Window functions can also be used inside calculated fields. Below an example of calculating the ratio of a purchase to the average purchase price in a 3 day window:

SELECT
      placed_dt,
      total_price,
      total_price / (AVG(total_price) OVER (
           ORDER BY placed_dt 
           ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)) as ratio
FROM orders

*remember that ROW constraints (i.e. what you write after ORDER BY) must be in logical order, from preceding to following row constraint. So we need to write:

ORDER BY year ROWS BETWEEN 2 PRECEDING AND CURRENT ROW

NOT

ORDER BY year ROWS BETWEEN CURRENT ROW AND 2 PRECEDING

Rows and Range

We could also use RANGE instead of ROWS in the window function.

The difference between ROWS and RANGE is that while ROWS will apply the window function to all the rows specified, RANGE will limit the window function to any rows that have the same value in the column which we order by.

ROWS applies the window function across any rows that are specified

SELECT
placed_dt,
total_price,
SUM(total_price) OVER (
  ORDER BY placed_dt
  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as window_rolling_sum
FROM orders


Output:

| placed_dt  | total_price | window_rolling_sum |
| ---------- | ----------- | ------------------ |
| 01-01-2020 | 1.23        | 1.23               |
| 02-01-2020 | 2.50        | 3.73               |
| 03-01-2020 | 5.23        | 8.96               |
| 04-01-2020 | 1.50        | 10.46              |<- grouped row by row
| 04-01-2020 | 9.53        | 19.99              |<- grouped row by row


RANGE groups the window function by the ORDER BY columns

SELECT
placed_dt,
total_price,
SUM(total_price) OVER (
  ORDER BY placed_dt
  RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as window_rolling_sum
FROM orders


Output:

| placed_dt  | total_price | window_rolling_sum |
| ---------- | ----------- | ------------------ |
| 01-01-2020 | 1.23        | 1.23               |
| 02-01-2020 | 2.50        | 3.73               |
| 03-01-2020 | 5.23        | 8.96               |
| 04-01-2020 | 1.50        | 19.99              |<- grouped by placed_dt
| 04-01-2020 | 9.53        | 19.99              |<- grouped by placed_dt


Using Partition By

We can do similar grouping by forgetting about ORDER BY and using PARTITION BY

SELECT
placed_dt,
total_price,
SUM(total_price) OVER(
  PARTITION BY placed_dt)
FROM orders


Output:

| placed_dt  | total_price | window_sum |
| ---------- | ----------- | ---------- |
| 01-01-2020 | 1.23        | 1.23       |
| 01-02-2020 | 2.50        | 2.50       |
| 01-03-2020 | 5.23        | 5.23       |
| 01-04-2020 | 1.50        | 11.03      |<- grouped by placed_dt
| 01-04-2020 | 9.53        | 11.03      |<- grouped by placed_dt


Order of operations

This is a good spot to refresh general rules for SQL order of operations. That is in what order the database processes your query. Starting from the first process:

  1. FROM – the database gets the data from tables in FROM clause and if necessary performs the JOINs,
  2. WHERE – the data are filtered with conditions specified in WHERE clause,
  3. GROUP BY – the data are grouped by with conditions specified in WHERE clause,
  4. aggregate functions — the aggregate functions are applied to the groups created in the GROUP BY phase,
  5. HAVING – the groups are filtered with the given condition,
  6. window functions,
  7. SELECT – the database selects the given columns,
  8. DISTINCT – repeated values are removed,
  9. UNION/INTERSECT/EXCEPT – the database applies set operations,
  10. ORDER BY – the results are sorted,
  11. OFFSET – the first rows are skipped,
  12. LIMIT/FETCH/TOP – only the first rows are selected

As we can see window functions get executed after WHERE, GROUP BY or HAVING. Practically, this order means that you can’t put window functions anywhere in the FROM, WHERE, GROUP BY or HAVING clauses.

To recap

  • You can define a window frame within OVER(...). The syntax is: [ROWS|RANGE] <window frame definition>.
  • <window frame definition> is defined with BETWEEN <lower bound> AND <upper bound>, where the bounds may be defined with:
  • UNBOUNDED PRECEDING,
  • n PRECEDING (ROWS only),
  • CURRENT ROW,
  • n FOLLOWING (ROWS only),
  • UNBOUNDED FOLLOWING
  • ROWS always treats rows individually (like the ROW_NUMBER() function), RANGE also considers rows which share the same value in the column we order by.
  • Window functions can only appear in the SELECT and ORDER BY clauses.
  • If you need window functions in other parts of the query, use a subquery.

Leave a comment

Design a site like this with WordPress.com
Get started