Search code examples
clickhouse

Clickhouse Cluster - Sharding data on distributed table with different number of shards & nodes


I have a 3 node ClickHouse cluster

<secret>clickhousecluster</secret>
     <shard>
         <replica>
             <host>CHN1</host>
             <port>9000</port>
         </replica>                
     </shard>
     <shard>
         <replica>
             <host>CHN2</host>
             <port>9000</port>
         </replica>                
     </shard>
     <shard>
         <replica>
             <host>CHN3</host>
             <port>9000</port>
         </replica>                
     </shard>

I created a distributed table:

columns:
`timestamp` DateTime CODEC(Delta(4), ZSTD(1)),
 `host` LowCardinality(String),`message` String ,
 `severity` LowCardinality(String),`message` String

create table db1.hosts_distr on
cluster cluster1 as db1.hosts engine = Distributed(cluster1, db1, hosts, if(severity =  'INFO', 1, 3));
    

This if(severity = 'INFO', 1, 3)) doesn't work either. It just puts all records on 1 node. What I need is to send all records with severity INFO to 2 nodes and all the rest to 1 node. Is it possible?

Thanks in advance.


Solution

  • when you insert data into distirbuted table than data will slit in temporary .bin files with Native format, by sharding_expression % number_of_shards in cluster

    so proper sharding expression should be in your case

    Distributed(cluster1, db1, hosts, if(severity = 'INFO', 1, rand() % 2));

    and you will need to change % 2 to % N when you add more shards

    if your db1.hosts have engine=ReplicatedMergeTree then you also need add <internal_replication>true</internal_replication> into each <shard>