I have 3 nodes with 3 shards and 2 replicas on each:
Added also the XML config for the sharding and replicas
<default_cluster>
<shard>
<internal_replication>true</internal_replication>
<replica>
<default_database>shard</default_database>
<host>clickhouse-0</host>
<port>9000</port>
<user>default</user>
<password>default</password>
</replica>
<replica>
<default_database>replica</default_database>
<host>clickhouse-2</host>
<port>9000</port>
<user>default</user>
<password>default</password>
</replica>
</shard>
<shard>
<internal_replication>true</internal_replication>
<replica>
<default_database>shard</default_database>
<host>clickhouse-1</host>
<port>9000</port>
<user>default</user>
<password>default</password>
</replica>
<replica>
<default_database>replica</default_database>
<host>clickhouse-0</host>
<port>9000</port>
<user>default</user>
<password>default</password>
</replica>
</shard>
<shard>
<internal_replication>true</internal_replication>
<replica>
<default_database>shard</default_database>
<host>clickhouse-2</host>
<port>9000</port>
<user>default</user>
<password>default</password>
</replica>
<replica>
<default_database>replica</default_database>
<host>clickhouse-1</host>
<port>9000</port>
<user>default</user>
<password>default</password>
</replica>
</shard>
</default_cluster>
I am doing the following example:
create database test on cluster default_cluster;
CREATE TABLE test.test_distributed_order_local on cluster default_cluster
(
id integer,
test_column String
)
ENGINE = ReplicatedMergeTree('/default_cluster/test/tables/test_distributed_order_local/{shard}', '{replica}')
PRIMARY KEY id
ORDER BY id;
CREATE TABLE test.test_distributed_order on cluster default_cluster as test.test_distributed_order_local
ENGINE = Distributed(default_cluster, test, test_distributed_order_local, id);
insert into test.test_distributed_order values (1, 'test1');
insert into test.test_distributed_order values (2, 'test2');
insert into test.test_distributed_order values (3, 'test3');
The results are not the same, and they contain duplications: Eg
Result 1:
Result 2:
What am I missing?
I expect to not have duplicated rows in the select
I think this post probably sums up what you're trying to achieve - https://altinity.com/blog/2018/5/10/circular-replication-cluster-topology-in-clickhouse It's a little old but the principle applies - For Clickhouse not a topology that's recommended.
Consider this simplified example:
<shard>
// These two are replicas of each other
<replica>
<host>cluster_node_0</host>
</replica>
**<replica>
<host>cluster_node_2</host>
</replica>**
</shard>
<shard>
<replica>
<host>cluster_node_1</host>
</replica>
<replica>
<host>cluster_node_0</host>
</replica>
</shard>
<shard>
**<replica>
<host>cluster_node_2</host>
</replica>**
<replica>
<host>cluster_node_1</host>
</replica>
</shard>
Let's suppose data is written into the first shard on node cluster_node_0
. It will then be replicated to the shard on cluster_node_2
- as the zookeeper path is the same.
Now for the issue. You have also defined the 3rd shard on cluster_node_2
. When you create this table, it will physically contain data from 2 shards - the 1st and 3rd - I've attempted to highlight with **
.
When a query comes in, it will be sent to each shard. The challenge is each local table will respond with results from both shards - hence you get duplicates.
Generally, avoid more than one shard on a host - the blog explains how you can achieve more than one buts its not recommended or ever need.