Search code examples
clickhouse

ClickHouse deduplication using idempotency_id column


I have a Clickhouse table like this. We ingest rows data where the unique field is idempotency_id.

This is a sample. We can't use other columns for uniquness (like max(date) o max(clicks).

2023-12-13 20:00:00  97  5c77ed31-90ba-4b2b-b8a7-b5d22b8eb662
2023-12-18 21:00:00 92  83b13098-2afb-434c-89ff-1b5e0c58bdf0
2023-12-12 20:00:00 76  217f0432-081d-4b40-9a22-8cd6460943ab
2023-12-13 15:00:00 74  d7eaccd4-e489-42ea-ae22-2bbadc1785eb
2023-12-18 20:00:00 72  4fb98924-c5d1-4b07-9d1e-a371f866dc43

I cannot use common deduplication strategies like ReplacingMergeTree(idempotency_id) or CollapsingMergeTree(idempotency_id) because idempotency_id is a string.

At the moment I remove duplicate at query-time using PARTITION BY ROW_NUMER but it consume to memory and query stopped.


Solution

  • SOLVED.

    CREATE TABLE my_table
    (
        `idempotency_id` UUID,
        
         ...other fields    
    
        version UInt32 DEFAULT now() 
    )
    ENGINE = ReplacingMergeTree
    PRIMARY KEY (idempotency_id)
    ORDER BY (idempotency_id, ....)
    

    Then when I want to be sure to remove duplicates I run

    OPTIMIZE TABLE my_table