Search code examples
sqlmysqlwindow-functions

How does unbounded preceding and current row work exactly?


I have the following data:

CREATE SCHEMA test_schema;
CREATE TABLE test_schema.joinedDf(id LONG, firstName TEXT, lastName TEXT, age INT, month INT, salary DECIMAL);

INSERT INTO test_schema.joinedDf(id, firstName, lastName, age, month, salary) VALUES
(1111, 'Peter', 'Ivanov', 29, 10, 300000),
(1111, 'Peter', 'Ivanov', 29, 12, 350000),
(1111, 'Peter', 'Ivanov', 29, 11, 350000);

When I do this:

SELECT id, firstName, lastName, age, month, salary,
AVG(salary) OVER (PARTITION BY id ORDER BY salary ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) average_salary
FROM test_schema.joinedDf;

I get this data:

1111, Peter, Ivanov, 29, 10, 300000, 333333.3333
1111, Peter, Ivanov, 29, 12, 350000, 333333.3333
1111, Peter, Ivanov, 29, 11, 350000, 333333.3333

But when I do:

SELECT id, firstName, lastName, age, month, salary,
AVG(salary) OVER (PARTITION BY id ORDER BY salary) average_salary
FROM test_schema.joinedDf;

I get:

1111, Peter, Ivanov, 29, 10, 300000, 300000.0000
1111, Peter, Ivanov, 29, 12, 350000, 333333.3333
1111, Peter, Ivanov, 29, 11, 350000, 333333.3333

I read that by default, if you specify order by inside the partition by clause. You get the window frame of unbounded preceding and current row. But why doesn't the data look like this?:

1111, Peter, Ivanov, 29, 10, 300000, 300000.0000
1111, Peter, Ivanov, 29, 12, 350000, 325000.0000
1111, Peter, Ivanov, 29, 11, 350000, 333333.3333

So, we first have Partition(300000), the average is 300000, then we have Partition(300000, 350000), the average is 325000, then we have Partition(300000, 350000, 350000) the average is 333333.3333. No?


Solution

  • The problem is that the default window frame (if you specify order by) is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW (more about it here, it's for spark, but mysql works similarly: What's the default window frame for window functions), while, if we write ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW it works fine.

    So, it's a ROWS vs. RANGE issue.

    From https://www.mysqltutorial.org/mysql-window-functions/ :

    The frame unit specifies the type of relationship between the current row and frame rows. It can be ROWS or RANGE. The offsets of the current row and frame rows are the row numbers if the frame unit is ROWS and row values the frame unit is RANGE.

    So, since we in our RANGE we have duplicates, it processes them as "one unit", instead of separately.

    For example, let's look at this data:

    enter image description here

    if we run the same query:

    select id, firstName, lastName, age, month, salary,
    AVG(salary) OVER (PARTITION BY id order by salary) average_salary
    from test_schema.joinedDf;
    

    we get:

    enter image description here

    Because the default is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW and it will first process this data:

    1. Frame(300,000) avg 300000
    2. Frame(300,000, 350,000, 350,000) avg 333333
    3. Frame(300,000, 350,000, 350,000) avg 333333
    4. Frame(300,000, 350,000, 350,000, 400,000, 400,000) avg 360,000
    5. Frame(300,000, 350,000, 350,000, 400,000, 400,000) avg 360,000

    While ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW would process this:

    1. Frame(300,000), avg 300,000
    2. Frame(300,000, 350,000), avg 325,000
    3. Frame(300,000, 350,000, 350,000), avg 333,333
    4. Frame(300,000, 350,000, 350,000, 400,000) avg 350,000
    5. Frame(300,000, 350,000, 350,000, 400,000, 400,000) avg 360,000.