I have the following data:
CREATE SCHEMA test_schema;
CREATE TABLE test_schema.joinedDf(id LONG, firstName TEXT, lastName TEXT, age INT, month INT, salary DECIMAL);
INSERT INTO test_schema.joinedDf(id, firstName, lastName, age, month, salary) VALUES
(1111, 'Peter', 'Ivanov', 29, 10, 300000),
(1111, 'Peter', 'Ivanov', 29, 12, 350000),
(1111, 'Peter', 'Ivanov', 29, 11, 350000);
When I do this:
SELECT id, firstName, lastName, age, month, salary,
AVG(salary) OVER (PARTITION BY id ORDER BY salary ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) average_salary
FROM test_schema.joinedDf;
I get this data:
1111, Peter, Ivanov, 29, 10, 300000, 333333.3333
1111, Peter, Ivanov, 29, 12, 350000, 333333.3333
1111, Peter, Ivanov, 29, 11, 350000, 333333.3333
But when I do:
SELECT id, firstName, lastName, age, month, salary,
AVG(salary) OVER (PARTITION BY id ORDER BY salary) average_salary
FROM test_schema.joinedDf;
I get:
1111, Peter, Ivanov, 29, 10, 300000, 300000.0000
1111, Peter, Ivanov, 29, 12, 350000, 333333.3333
1111, Peter, Ivanov, 29, 11, 350000, 333333.3333
I read that by default, if you specify order by
inside the partition by
clause. You get the window frame of unbounded preceding and current row
. But why doesn't the data look like this?:
1111, Peter, Ivanov, 29, 10, 300000, 300000.0000
1111, Peter, Ivanov, 29, 12, 350000, 325000.0000
1111, Peter, Ivanov, 29, 11, 350000, 333333.3333
So, we first have Partition(300000)
, the average is 300000
, then we have Partition(300000, 350000)
, the average is 325000
, then we have Partition(300000, 350000, 350000)
the average is 333333.3333
. No?
The problem is that the default window frame (if you specify order by
) is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
(more about it here, it's for spark, but mysql works similarly: What's the default window frame for window functions), while, if we write ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
it works fine.
So, it's a ROWS
vs. RANGE
issue.
From https://www.mysqltutorial.org/mysql-window-functions/ :
The frame unit specifies the type of relationship between the current row and frame rows. It can be ROWS or RANGE. The offsets of the current row and frame rows are the row numbers if the frame unit is ROWS and row values the frame unit is RANGE.
So, since we in our RANGE
we have duplicates, it processes them as "one unit", instead of separately.
For example, let's look at this data:
if we run the same query:
select id, firstName, lastName, age, month, salary,
AVG(salary) OVER (PARTITION BY id order by salary) average_salary
from test_schema.joinedDf;
we get:
Because the default is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
and it will first process this data:
While ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
would process this: