Search code examples
sqldata-analysisclickhouse

How to add rows into table by condition in ClickHouse?


I have a table in ClickHouse with events of connecting and disconnecting to system. Query select timestamp, username, event from table gives following result.

timestamp username event
December 20, 2022, 18:24 1 Connect
December 20, 2022, 18:30 1 Disconnect
December 20, 2022, 18:34 1 Connect
December 21, 2022, 12:07 1 Disconnect
December 20, 2022, 12:15 2 Connect
December 20, 2022, 12:47 2 Disconnect

The session must be show in table as finished by the end of the day. If user was connected to system on 20th December and had no "Disconnect" after that in the same day, I have to add such 'Disconnect' event at the table with some query. And also I have to add row with event of 'Connect' to the table at 00:00 of next day. For example, in sample table you can see that user #1 had not finished session on 20th December, so I want to have the following result:

timestamp username event
December 20, 2022, 18:24 1 Connect
December 20, 2022, 18:30 1 Disconnect
December 20, 2022, 18:34 1 Connect
December 20, 2022, 23:59 1 Disconnect
December 21, 2022, 00:00 1 Connect
December 21, 2022, 12:07 1 Disconnect
December 20, 2022, 12:15 2 Connect
December 20, 2022, 12:47 2 Disconnect

Is there any way to amend the query so it will work as I described above? ClickHouse is not so common as Posrgres or SQL Server, as far as I know, so code in Postgres dialect will be fine, I will find out how to make the same with ClickHouse.


Solution

  • You do not need lateral join to achieve desired result in Clickhouse (JOINs in Clickhouse are always compute heavy operations since it is a columnar store. ClickHouse takes the right table and creates a hash table for it in RAM).

    You can use UNION ALL and ARRAY JOIN in specific way to generate missing rows:

    CREATE TABLE connections
    (
        `timestamp` DateTime,
        `username` LowCardinality(String),
        `event` enum('Connect', 'Disconnect')
    )
    ENGINE = Memory;
    
    INSERT INTO connections VALUES
    ('2022-12-20 18:24:00'::DateTime,   '1',    'Connect')
    ('2022-12-20 18:30:00'::DateTime,   '1',    'Disconnect')
    ('2022-12-20 18:34:00'::DateTime,   '1',    'Connect')
    ('2022-12-21 12:07:00'::DateTime,   '1',    'Disconnect')
    ('2022-12-20 12:15:00'::DateTime,   '2',    'Connect')
    ('2022-12-20 12:47:00'::DateTime,   '2',    'Disconnect');
    
    SELECT * FROM 
      (
        SELECT 
          timestamp, username, event 
        FROM 
          connections 
        UNION ALL 
        SELECT 
          timestamp, username, event 
        FROM 
          (
            SELECT 
              [toStartOfDay(timestamp) + INTERVAL '1 DAY' - INTERVAL '1 SECOND', 
              toStartOfDay(timestamp) + INTERVAL '1 DAY' ] timestamps, 
              username, 
              [ 'Disconnect', 'Connect' ] :: Array(Enum('Connect', 'Disconnect')) events 
            FROM 
              connections 
            GROUP BY 
              toStartOfDay(timestamp), username 
            HAVING 
              anyLast(event) = 'Connect'
          ) ARRAY JOIN 
            timestamps AS timestamp, 
            events AS event
      ) 
    ORDER BY 
      username, timestamp
    

    Here is the result:

    ┌───────────timestamp─┬─username─┬─event──────┐
    │ 2022-12-20 18:24:00 │ 1        │ Connect    │
    │ 2022-12-20 18:30:00 │ 1        │ Disconnect │
    │ 2022-12-20 18:34:00 │ 1        │ Connect    │
    │ 2022-12-20 23:59:59 │ 1        │ Disconnect │
    │ 2022-12-21 00:00:00 │ 1        │ Connect    │
    │ 2022-12-21 12:07:00 │ 1        │ Disconnect │
    │ 2022-12-20 12:15:00 │ 2        │ Connect    │
    │ 2022-12-20 12:47:00 │ 2        │ Disconnect │
    └─────────────────────┴──────────┴────────────┘
    
    8 rows in set. Elapsed: 0.011 sec.