Search code examples
clickhouse

Clickhouse. Create database on cluster ends with timeout


I have a cluster which consists of two nodes of Clickhouse. Both instances are in docker containers. All communications between hosts are successfully checked - ping, telnet, wget works fine. In the Zookeeper I can see my fired queries under the ddl brunch.

Every execution of the statement "create database on cluster " ends with timeout. What is the problem? Does anybody have any ideas?

There are fragments of the config file.

Ver 20.10.3.30

<remote_servers>
        <history_cluster>
            <shard>
                <replica>
                    <host>10.3.194.104</host>
                    <port>9000</port>
                </replica>
                <replica>
                    <host>10.3.194.105</host>
                    <port>9000</port>
                </replica>
            </shard>
        </history_cluster>
  </remote_servers>
  <zookeeper>
                <node index="1">
                        <host>10.3.194.106</host>
                        <port>2181</port>
                </node>
  </zookeeper>

The "macros" section

    <macros incl="macros" optional="true" />

The log fragment

2020.11.20 22:38:44.104001 [ 90 ] {68062325-a6cf-4ac3-a355-c2159c66ae8b} <Error> executeQuery: Code: 159, e.displayText() = DB::Exception: Watching task /clickhouse/task_queue/ddl/query-0000000013 is executing longer than distributed_ddl_task_timeout (=180) seconds. There are 2 unfinished hosts (0 of them are currently active), they are going to execute the query in background (version 20.10.3.30 (official build)) (from 172.17.0.1:51272) (in query: create database event_history on cluster history_cluster;), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&, long&, unsigned long&, unsigned long&>(int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&, long&, unsigned long&, unsigned long&) @ 0xd8dcc75 in /usr/bin/clickhouse
1. DB::DDLQueryStatusInputStream::readImpl() @ 0xd8dc84d in /usr/bin/clickhouse
2. DB::IBlockInputStream::read() @ 0xd71b1a5 in /usr/bin/clickhouse
3. DB::AsynchronousBlockInputStream::calculate() @ 0xd71761d in /usr/bin/clickhouse
4. ? @ 0xd717db8 in /usr/bin/clickhouse
5. ThreadPoolImpl<ThreadFromGlobalPool>::worker(std::__1::__list_iterator<ThreadFromGlobalPool, void*>) @ 0x7b8c17d in /usr/bin/clickhouse
6. std::__1::__function::__func<ThreadFromGlobalPool::ThreadFromGlobalPool<void ThreadPoolImpl<ThreadFromGlobalPool>::scheduleImpl<void>(std::__1::function<void ()>, int, std::__1::optional<unsigned long>)::'lambda1'()>(void&&, void ThreadPoolImpl<ThreadFromGlobalPool>::scheduleImpl<void>(std::__1::function<void ()>, int, std::__1::optional<unsigned long>)::'lambda1'()&&...)::'lambda'(), std::__1::allocator<ThreadFromGlobalPool::ThreadFromGlobalPool<void ThreadPoolImpl<ThreadFromGlobalPool>::scheduleImpl<void>(std::__1::function<void ()>, int, std::__1::optional<unsigned long>)::'lambda1'()>(void&&, void ThreadPoolImpl<ThreadFromGlobalPool>::scheduleImpl<void>(std::__1::function<void ()>, int, std::__1::optional<unsigned long>)::'lambda1'()&&...)::'lambda'()>, void ()>::operator()() @ 0x7b8e67a in /usr/bin/clickhouse
7. ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>) @ 0x7b8963d in /usr/bin/clickhouse
8. ? @ 0x7b8d153 in /usr/bin/clickhouse
9. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
10. clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so


Solution

  • The most probable issue is the nodes docker internal IPs/hostnames.

    A node initiator (where the 'on cluster' is executed) puts into ZK a task for 10.3.194.104 and 10.3.194.105. All nodes constantly check the task queue and pull their task. If their IPs/hostnames are 127.0.0.1 / localhost they never find their tasks. Because 10.3.194.104 != 127.0.0.1.