Search code examples
sqlwindow-functionsclickhouse

SQL (clickhouse) grouping INSIDE window function


I need help with writing some equivalent of "group by" inside window function. I have a data in such format:

user_id day task_id task_score task_pass
1 01 1 0.95 true
1 01 2 0.7 false
1 03 2 0.95 true
1 04 2 0.9 true
1 05 5 0.85 false

Each day is guaranteed to have only one row per present task_id. And I need a function that for each day-user would:

  1. Grab data before that day including that day
  2. Group data over task_id and calculate some aggregation function, for example, max(task_score)
  3. Take these grouped values, calculate another aggregation function over them and write that in column. For example, average of max_score for each task.

End result should be like that:

user_id day average_max_score
1 01 0.825
1 02 0.825
1 03 0.95
1 04 0.95
1 05 0.916

Basically, I want to group by one column inside window function.

Honestly, I don't even know how to approach this. Native and "simple" solution would be: take all tasks, cross join them with user-day table, then use window function like this one:

max(task_score) over (
  partition by user_id, task_id 
  order by day rows between 
    unbounded preceding
    and current row) as max_score_to_this_day 

and then use group by over day, user_id:

select avg(max_score_to_this_day)
from table
group by user_id, day 

This approach probably will work, but it will create some unreasonable amounts of data since not every user did every task, it will create user-day-task rows for days user have yet to touch task(and not every user will try every task), and in general this looks like a bad approach.

Is there any better way to do what I've described?


Solution

    1. sorry about table name, I am really tired to write CREATE_TABLE/INSERT, I spend more time on CREATE_TABLE/INSERT than on the real query. I don't understand WHY people who ask questions make people WHO answer questions to do really stupid work.

    2. you don't have day=2 in the input data. It's really confusing.

    create table slava_ukraini(user_id Int64, day Int64, task_id Int64, task_score Float64, task_pass String )
    Engine = Memory as 
    select * from values (
    (1,     01,     1,  0.95,   'true'),
    (1,     02,     2,  0.7,    'false'),
    (1,     03,     2,  0.95,   'true'),
    (1,     04,     2,  0.9,    'true'),
    (1,     05,     5,  0.85,   'false'));
    
    
    
    select user_id, day, avg(max_task_score) avg_max_task_score from 
    (
        select user_id, day, task_id, max(task_score) max_task_score from 
        (
          select user_id, day, arrayJoin(ga) g, g.2 task_id, g.3 task_score
          from (
              select user_id, day, 
                     groupArray( (day, task_id, task_score, task_pass) ) over (partition by user_id order by day rows BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)  ga
              from slava_ukraini
          )
        ) group by user_id, day, task_id
    ) group by user_id, day
    order by user_id, day
    
    ┌─user_id─┬─day─┬─avg_max_task_score─┐
    │       1 │   1 │               0.95 │
    │       1 │   2 │              0.825 │
    │       1 │   3 │               0.95 │
    │       1 │   4 │               0.95 │
    │       1 │   5 │ 0.9166666666666666 │
    └─────────┴─────┴────────────────────┘