The number of launched background merges raises periodically

I've found it so wired that CPU load of the server on which our ClickHouse deploys raises every 4 fours since 12.1, as there are no scheduled jobs. CPU Load during Last 2 Days

Then I found the value of ProfileEvent_Merge and ProfileEvent_MergedRows in system table metric_log might illustrate the reason CPU load raise. The number of merges and number of merge rows have a similar trend as shown below. Merged Rows Per Second during Last 90 Days

I followed the changes of system.merges at 4 p.m. today (Dec. 8th). Finally, I found that the collapse progress of two system tables, query_log and query_thread_log, takes about 200s. During collapse, CPU load raised definitely.

I'm using ClickHouse 21.11.4.14. I have 8 user tables, 1 using MergeTree engine, 3 using ReplacingMergeTree, and 4 using VersionedCollapsingMergeTree. The server has 2 cores, and the OS is CentOS 7.9 64 bit.

If there is anything concerned but I didn't mention it, or more information you want to know, reply please.

There are three things about merges I can't understand. 1.Why does the number of merges suddenly increase at the first day of a month, such as Oct. 1st, Nov. 1st and Dec. 1st, under the circumstances where there are no changes on configuration or on table structure. 2.Why does the number of merges increase every 4 fours. I do remember there is no rule for merging. 3.Why does the number of merges decrease gradually and generally during Oct., Nov. and Dec..

Could you please explain them for me? And what can I do to solve the CPU-load problem? Thanks very much for your reply.

This is DDL of system.query_log. I've modify TTL of some log tables in system database to TTL event_time + toIntervalMonth(3), like query_log, query_thread_log, metric_log and etc., in order to save disk space.

create table query_log
(
    type Enum8('QueryStart' = 1, 'QueryFinish' = 2, 'ExceptionBeforeStart' = 3, 'ExceptionWhileProcessing' = 4),
    event_date                            Date,
    event_time                            DateTime,
    event_time_microseconds               DateTime64(6),
    query_start_time                      DateTime,
    query_start_time_microseconds         DateTime64(6),
    query_duration_ms                     UInt64,
    read_rows                             UInt64,
    read_bytes                            UInt64,
    written_rows                          UInt64,
    written_bytes                         UInt64,
    result_rows                           UInt64,
    result_bytes                          UInt64,
    memory_usage                          UInt64,
    current_database                      String,
    query                                 String,
    formatted_query                       String,
    normalized_query_hash                 UInt64,
    query_kind LowCardinality(String),
    databases Array(LowCardinality(String)),
    tables Array(LowCardinality(String)),
    columns Array(LowCardinality(String)),
    projections Array(LowCardinality(String)),
    views Array(LowCardinality(String)),
    exception_code                        Int32,
    exception                             String,
    stack_trace                           String,
    is_initial_query                      UInt8,
    user                                  String,
    query_id                              String,
    address                               IPv6,
    port                                  UInt16,
    initial_user                          String,
    initial_query_id                      String,
    initial_address                       IPv6,
    initial_port                          UInt16,
    initial_query_start_time              DateTime,
    initial_query_start_time_microseconds DateTime64(6),
    interface                             UInt8,
    os_user                               String,
    client_hostname                       String,
    client_name                           String,
    client_revision                       UInt32,
    client_version_major                  UInt32,
    client_version_minor                  UInt32,
    client_version_patch                  UInt32,
    http_method                           UInt8,
    http_user_agent                       String,
    http_referer                          String,
    forwarded_for                         String,
    quota_key                             String,
    revision                              UInt32,
    log_comment                           String,
    thread_ids Array(UInt64),
    ProfileEvents Map(String, UInt64),
    Settings Map(String, String),
    used_aggregate_functions Array(String),
    used_aggregate_function_combinators Array(String),
    used_database_engines Array(String),
    used_data_type_families Array(String),
    used_dictionaries Array(String),
    used_formats Array(String),
    used_functions Array(String),
    used_storages Array(String),
    used_table_functions Array(String),
    `ProfileEvents.Names` Array(String)   alias mapKeys(ProfileEvents),
    `ProfileEvents.Values` Array(UInt64)  alias mapValues(ProfileEvents),
    `Settings.Names` Array(String)        alias mapKeys(Settings),
    `Settings.Values` Array(String)       alias mapValues(Settings)
)
    engine = MergeTree PARTITION BY toYYYYMM(event_date)
        ORDER BY (event_date, event_time)
        TTL event_time + toIntervalMonth(3)
        SETTINGS index_granularity = 8192;

Solution

...
  PARTITION BY toYYYYMM(event_date)
...
 TTL event_time + toIntervalMonth(3)

The table is partitioned by month but TTL stores data for 3 months. So every 4 hours Clickhouse does a TTL Merge, it rewrites the oldest parts without data (expired rows) which is older than 3 months.

But instead of TTL re-merge, TTL is able to drop the whole partitions without reading/rewriting them. This drop operation is much lighter for CPU and disk I/O.

You can enable it using a mergetree setting ttl_only_drop_parts, in this case data will be stored 4 months, because the all rows in the partition are expired after 4 months. Another way is to change the partitioning to daily and set TTL rule accordingly to partitioning.

  PARTITION BY toYYYYMMDD(event_date)
...
 TTL toStartOfDay(event_time) + toIntervalMonth(3)

toStartOfDay -- all rows in a part will be expired at the same moment of time.

toYYYYMMDD -- daily partitions

https://kb.altinity.com/altinity-kb-setup-and-maintenance/altinity-kb-system-tables-eat-my-disk/

https://kb.altinity.com/altinity-kb-queries-and-syntax/ttl/

https://clickhouse.com/docs/en/operations/settings/settings/#ttl_only_drop_parts

https://clickhouse.com/docs/en/operations/settings/merge-tree-settings/