I have a Clickhouse table like this. We ingest rows data where the unique field is idempotency_id.
This is a sample. We can't use other columns for uniquness (like max(date) o max(clicks).
I cannot use common deduplication strategies like ReplacingMergeTree(idempotency_id) or CollapsingMergeTree(idempotency_id) because idempotency_id is a string.
At the moment I remove duplicate at query-time using PARTITION BY ROW_NUMER but it consume to memory and query stopped.
SOLVED.
CREATE TABLE my_table
(
`idempotency_id` UUID,
...other fields
version UInt32 DEFAULT now()
)
ENGINE = ReplacingMergeTree
PRIMARY KEY (idempotency_id)
ORDER BY (idempotency_id, ....)
Then when I want to be sure to remove duplicates I run
OPTIMIZE TABLE my_table