Search code examples
pythonsshdaskdask-dataframedask-kubernetes

How can I establish an SSH tunnel to a database for Dask?


I want to connect to postgres db using Dask where the database is behind an SSH tunnel. While there are methods to create SSH tunnels in Python, I haven't found a straightforward way to integrate this with Dask's connection mechanisms.

One potential solution is to locally port forward the database connection and then connect to localhost:port, where the tunnel is established. However, I'm uncertain about how this approach will function within a Dask cluster. Since the SSH tunnel is created on the node where the tunnel code is executed, it might not be accessible on Dask workers.

I'm currently using the

dd.read_sql_query(query, connection_string)

method for reading SQL queries in Dask. I'm considering whether I need to create the SSH tunnel on each worker node using

client.run(create_ssh_tunnel) 

However, I'm unsure about how this will interact with auto-scaling. Specifically, during periods of high load when Dask workers autoscale, will they first create the tunnel on the worker nodes?


Solution

  • during periods of high load when Dask workers autoscale, will they first create the tunnel on the worker nodes

    No, client.run happens once when you call it, on the workers that are connected at the time. If you want each worker to do something at launch, you want a Nanny (if using) or Worker plugin, see https://distributed.dask.org/en/stable/plugins.html . You would define a setup() method calling your SSH creation function. You should ensure that you handle possible error cases (clean up on teardown, and cope with the tunnel already being open).

    Minimal example:

    class WorkerSSH(WorkerPlugin):
        def setup(self, worker):
            create_ssh_tunnel()
    
    plugin = WorkerSSH(logging)
    client.register_plugin(plugin)