Search code examples
sqlclickhouse

Clickhouse creating columns base on value of other rows


I'm trying to know how much record a user has generated during the last x hours from when the record is created. In classic SQL what I've tried so far should be working.

I'm posting here because I didn't find anything specific to Clickhouse about this specific problem.

Have a nice day ;)

Simplified example - source table:

ID | time          | name  
---+---------------+---------
1   05/05 14:20       bob         
2   05/05 14:30       josh       
3   05/05 18:30       bob         
4   06/05 15:30       bob         
5   08/05 18:30       josh           

Select result - source table :

ID | time          | name  | nb_last_24hours
---+---------------+-------+----------------
1   05/05 14:20       bob         0
2   05/05 14:30       josh        0
3   05/05 18:30       bob         1
4   06/05 15:30       bob         1
5   08/05 18:30       josh        0

I've tried this query:

SELECT  
    b.ID, b.`time`, b.name,
    (SELECT COUNT(*) 
     FROM T1 AS a 
     WHERE a.name = b.name 
       AND a.`time` < b.`time` 
       AND a.`time` >= (b.`time` - (60 * 60 * 24))
    )
FROM  
    T1 AS b;

but I get the following error:

SQL Error [47] [07000]: Code: 47. DB::Exception: Missing columns: 'b.time' 'b.a_msisdn' while processing query: 'SELECT count() FROM VFG.voice_traffic AS a WHERE (a_msisdn = b.a_msisdn) AND (time < b.time) AND (time >= (b.time - ((60 * 60) * 24)))', required columns: 'a_msisdn' 'b.a_msisdn' 'time' 'b.time', maybe you meant: ['a_msisdn','a_msisdn','time','time']: While processing (SELECT count(*) FROM VFG.voice_traffic AS a WHERE (a.a_msisdn = b.a_msisdn) AND (a.time < b.time) AND (a.time >= (b.time - ((60 * 60) * 24)))) AS _subquery4916. (UNKNOWN_IDENTIFIER) (version 22.3.3.44 (official build)) , server ClickHouseNode [uri=http://192.168.13.79:8123/default, options={use_server_time_zone=false,use_time_zone=false}]@-1433098529

I'm assuming that b doesn't exist in the context of the inner query.


Solution

  • Try COUNT with RANGE.

    SELECT 
      c1, c2
    , COUNT (1) OVER 
      (
        PARTITION BY c1
        ORDER BY c2 
        RANGE BETWEEN 
        86400 PRECEDING 
        AND 
        1 PRECEDING
      )
      as cnt
    FROM VALUES 
    (
      ('bob', now())
    , ('bob', now() + INTERVAL 10 HOUR)
    , ('bob', now() + INTERVAL 25 HOUR)
    
    , ('josh', now())
    , ('josh', now() + INTERVAL 10 HOUR)
    , ('josh', now() + INTERVAL 15 HOUR)
    )
    FORMAT PrettyCompact
    

    The result is:

    +-c1---+------------------c2-+-cnt-+
    | bob  | 2023-09-15 20:22:34 |   0 |
    | bob  | 2023-09-16 06:22:34 |   1 |
    | bob  | 2023-09-16 21:22:34 |   1 |
    | josh | 2023-09-15 20:22:34 |   0 |
    | josh | 2023-09-16 06:22:34 |   1 |
    | josh | 2023-09-16 11:22:34 |   2 |
    +------+---------------------+-----+
    

    fiddle