I need help with writing some equivalent of "group by" inside window function. I have a data in such format:
user_id | day | task_id | task_score | task_pass |
---|---|---|---|---|
1 | 01 | 1 | 0.95 | true |
1 | 01 | 2 | 0.7 | false |
1 | 03 | 2 | 0.95 | true |
1 | 04 | 2 | 0.9 | true |
1 | 05 | 5 | 0.85 | false |
Each day is guaranteed to have only one row per present task_id. And I need a function that for each day-user would:
End result should be like that:
user_id | day | average_max_score |
---|---|---|
1 | 01 | 0.825 |
1 | 02 | 0.825 |
1 | 03 | 0.95 |
1 | 04 | 0.95 |
1 | 05 | 0.916 |
Basically, I want to group by one column inside window function.
Honestly, I don't even know how to approach this. Native and "simple" solution would be: take all tasks, cross join them with user-day table, then use window function like this one:
max(task_score) over (
partition by user_id, task_id
order by day rows between
unbounded preceding
and current row) as max_score_to_this_day
and then use group by over day, user_id:
select avg(max_score_to_this_day)
from table
group by user_id, day
This approach probably will work, but it will create some unreasonable amounts of data since not every user did every task, it will create user-day-task rows for days user have yet to touch task(and not every user will try every task), and in general this looks like a bad approach.
Is there any better way to do what I've described?
sorry about table name, I am really tired to write CREATE_TABLE/INSERT, I spend more time on CREATE_TABLE/INSERT than on the real query. I don't understand WHY people who ask questions make people WHO answer questions to do really stupid work.
you don't have day=2 in the input data. It's really confusing.
create table slava_ukraini(user_id Int64, day Int64, task_id Int64, task_score Float64, task_pass String )
Engine = Memory as
select * from values (
(1, 01, 1, 0.95, 'true'),
(1, 02, 2, 0.7, 'false'),
(1, 03, 2, 0.95, 'true'),
(1, 04, 2, 0.9, 'true'),
(1, 05, 5, 0.85, 'false'));
select user_id, day, avg(max_task_score) avg_max_task_score from
(
select user_id, day, task_id, max(task_score) max_task_score from
(
select user_id, day, arrayJoin(ga) g, g.2 task_id, g.3 task_score
from (
select user_id, day,
groupArray( (day, task_id, task_score, task_pass) ) over (partition by user_id order by day rows BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) ga
from slava_ukraini
)
) group by user_id, day, task_id
) group by user_id, day
order by user_id, day
┌─user_id─┬─day─┬─avg_max_task_score─┐
│ 1 │ 1 │ 0.95 │
│ 1 │ 2 │ 0.825 │
│ 1 │ 3 │ 0.95 │
│ 1 │ 4 │ 0.95 │
│ 1 │ 5 │ 0.9166666666666666 │
└─────────┴─────┴────────────────────┘