Search code examples
clickhouse

How to check not seen recently data via materialized view in clickhouse


I got this tables

  • seen_ids table with ids.
  • source_ids null engine table where data comes
  • target_ids with fresh ids that was not in seen_ids on the moment when it comes.

I need to store only ids that are not in seen_ids:

    CREATE materialized view mv_fresh
    TO target_ids
    AS SELECT
      *
    FROM source_ids where id not in (select id from seen_ids)

Also Id adds to seen_ids, so then it comes next time it is not added to target_ids.

Other option is to use LEFT JOIN, may be it is also not good idea for large table.

This is example of what I need to do, what ways do you recommend to check on "not seen recently" ids? This example is probably bad design for large seen_ids table (50 mln rows) and large income data.


Solution

  • Some ways to improve perf:

    1. try to optimize IN operator
    CREATE materialized view mv_fresh
        TO target_ids
        AS SELECT
         ..
        FROM source_ids
        where id not in (
            select id from seen_ids 
            where id IN (select id FROM source_ids))
    
    1. consider to reduce index_granularity for db.seen_ids or use skipping indexes

    See Skipping Indexes: set.

    CREATE TABLE db.seen_ids_local ON CLUSTER cl
    (..)
    ORDER BY id
    ENGINE = MergeTree()
    SETTINGS index_granularity = 256 or 512 or ..;
    
    1. try to find some data insights that help to reduce the amount of processed data

    For example, if id is incremental then can be applied some condition WHERE id > {some_value_that_calculated_by_MV}

    1. use SET engine

    Make sense to consider using a table with SET engine to store seen IDs and use it in IN operator.

    1. bulk insert (thousands of rows or more)

    It needs to

    • either change the logic of the service that is writing data to db.source_ids
    • or use Buffer engine (see related questions on SO)
    1. use distributed calculation (sharding) (split IDs to groups that be proceeded independently)

    For example, for 2 shards on each one be proceeded 50 / 2 = 25 mln rows of seen_ids table.

    CREATE TABLE db.seen_ids_local ON CLUSTER cl
    (..)
    ORDER BY id
    ENGINE = MergeTree();
    
    /* ! Write data to distributed table to reshard them among shards. Or write data directly to shard based on sharding key. */
    CREATE TABLE db.seen_ids ON CLUSTER cl AS db.seen_ids_local
    ENGINE = Distributed(cl, db, seen_ids_local, intHash32(id)); /* Reshard data by 'id'. */
    
    
    CREATE TABLE db.source_ids_local ON CLUSTER cl
    (..)
    ENGINE = Null;
    
    /* ! Write data to distributed table to reshard them among shards. */
    CREATE TABLE db.source_ids ON CLUSTER cl AS db.source_ids_local
    ENGINE = Distributed(cl, db, source_ids_local, intHash32(id)); /* Reshard data by 'id'. The same sharding key as for 'seen_ids' */
    
    
    CREATE TABLE db.target_ids_local ON CLUSTER cl
    (..)
    ENGINE = MergeTree();
    
    CREATE materialized view db.target_ids_local_mv
    TO db.target_ids_local  /* [Optional] Write data to local table */
    AS SELECT
    ..  
    FROM db.source_ids_local /* !! Load data from local table */
    where id not in (
      select id from db.seen_ids_local /* !! Load data from local table */
    )
    

    Look at these docs: Altinity Knowledge Base.