Search code examples
dockerairflowwindows-subsystem-for-linuxclickhouse

Can not connect to ClickHouse via Apache Airflow (all inside Docker)


I have Clickhouse-server and Airflow all are Docker containers in WSL. They're both running and i even managed to connect to Clickhouse somehow, but when trying to query i get the following traceback in Airflow logs:

[2024-06-04, 12:17:49 UTC] {logging_mixin.py:188} INFO - >>> Running 'etl' function. Logged at 2024-06-04 12:17:49.678034+00:00
.....
[2024-06-04, 12:17:53 UTC] {logging_mixin.py:188} INFO - >>> Running 'clickhouse_connect' function. Logged at 2024-06-04 12:17:53.590231+00:00
[2024-06-04, 12:17:53 UTC] {logging_mixin.py:188} INFO - Connecting to ClickHouse Cluster
[2024-06-04, 12:17:53 UTC] {base.py:84} INFO - Using connection ID 'my_clickhouse' for task execution.
[2024-06-04, 12:17:53 UTC] {logging_mixin.py:188} INFO - 8123
[2024-06-04, 12:17:53 UTC] {logging_mixin.py:188} INFO - Connected successfully
[2024-06-04, 12:17:53 UTC] {logging_mixin.py:188} INFO - Querying CHDB
[2024-06-04, 12:17:53 UTC] {connection.py:408} WARNING - Failed to connect to localhost:9000
Traceback (most recent call last):
.....
clickhouse_driver.errors.NetworkError: Code: 210. Connection refused (localhost:9000)

It seems that Airflow is trying to connect to localhost:9000 which is supposed for only CLI usage, and not to localhost:8123 which is for the http web-interface.

What should I do to force 8123 port connection or am I getting it wrong? Anyways, the main problem is: I can't work with Clickhouse via Airflow.

PS: I use clickhouse_driver.Client for connection inside PythonOperator.

And this code to connect:

class ClickHouseConnection:
    connection = None
    def get_connection(connection_name='my_clickhouse'):
        from clickhouse_driver import Client

        if ClickHouseConnection.connection:
            return connection
        db_props = BaseHook.get_connection(connection_name)
        ClickHouseConnection.connection = Client(db_props.host)
        return ClickHouseConnection.connection, db_props

@logger
def clickhouse_connect():
    print("Connecting to ClickHouse Cluster")
    ch_connection, db_props = ClickHouseConnection.get_connection()
    print(db_props.port)
    print("Connected successfully")
    filename = '/opt/airflow/data/test_csv_file.csv'
    if ch_connection:
        print("Querying CHDB")
        ch_connection.execute(
            f"INSERT INTO maindb.monitor FROM INFILE {filename} FORMAT CSV"
        )

Thank you in advance!

I tried to change port to 9000 in Airflow's Connections and restart containers, but nothing has changed.


Solution

  • The issue was in host name of clickhouse connection. Don't quite remember where I read it, but I should've used host name exactly like host.docker.internal (as long as I run docker multi-container application) and port 9000 is somehow worked for me.