Search code examples
ttlclickhouse

ClickHouse TTL on materialized column


I am trying to upgrade the clickhouse cluster from version 18.8 to 19.9.2. Previously, I had a cronjob that deletes old data from the database. I want to start using TTL feature instead.

Simplified table definition:

    CREATE TABLE myTimeseries(
                   timestamp_ns Int64,
                   source_id String,
                   data String,
                   date Date MATERIALIZED toDate(timestamp_ns/1e9),
                   time DateTime MATERIALIZED toDateTime(timestamp_ns/1e9)) 
    ENGINE = MergeTree()
    PARTITION BY (source_id, toStartOfHour(time))
    TTL date + toInterValDay(7)
    SETTINGS index_granularity=8192, merge_with_ttl_timeout=43200

The problem is, it does not delete old data. I could not find anything in the documentation that would help debug this issue.

Questions:

  1. How can I debug this issue? (Is there a way to see when the data will be cleared in the future)?

  2. Might this be because of date field being materialized? I have another table where date is not a materialized field and everything works fine.


Solution

  • Yes, you can use materialized fields with TTL feature. I've attached simple query that create table with 5 minutes interval to delete. It works fine with clickhouse server version 20.4.5

    CREATE TABLE IF NOT EXISTS test.profiling
    (
        headtime UInt64, 
        date DateTime MATERIALIZED toDateTime(headtime),  
        id Int64,  
        operation_name String,
        duration Int64
    )   
    ENGINE MergeTree()
    PARTITION BY toYYYYMM(date)
    ORDER BY (date, id)
    TTL date + INTERVAL 5 MINUTE
    

    And important note from clickhouse documentation:

    Data with an expired TTL is removed when ClickHouse merges data parts.

    When ClickHouse see that data is expired, it performs an off-schedule merge. To control the frequency of such merges, you can set merge_with_ttl_timeout. If the value is too low, it will perform many off-schedule merges that may consume a lot of resources.

    If you perform the SELECT query between merges, you may get expired data. To avoid it, use the OPTIMIZE query before SELECT.