Search code examples
rtidyverseportforwardingdbplyr

Translate Python MySQL ssh port forwarding solution to R (dbplyr)


I'm trying to query a MySQL server through an R/Tidyverse/dbplyr workflow. My MySQL access requires configuring SSH and port forwarding.

I have this code working using python (below), but I'm struggling to get started with the SSH/port forwarding equivalent in R. Any pointers to solutions or equivalent R packages appreciated. thanks.

import pymysql
import paramiko
import pandas as pd
from paramiko import SSHClient
from sshtunnel import SSHTunnelForwarder
from os.path import expanduser


pkeyfilepath = '/.ssh/id_ed25519'
home = expanduser('~')
mypkey = paramiko.Ed25519Key.from_private_key_file(home + pkeyfilepath)

sql_hostname = 'mycompany.com'
sql_username = 'me'
sql_password = '***'
sql_main_database = 'my_db'
sql_port = 3306
ssh_host = 'jumphost.mycompany.com'
ssh_user = 'me'
ssh_port = 22

with SSHTunnelForwarder(
        (ssh_host, ssh_port),
        ssh_username=ssh_user,
        ssh_pkey=mypkey,
        remote_bind_address=(sql_hostname, sql_port)) as tunnel:
    conn = pymysql.connect(host='127.0.0.1', user=sql_username,
                           passwd=sql_password, db=sql_main_database,
                           port=tunnel.local_bind_port)
    query = '''SELECT VERSION();'''
    data = pd.read_sql_query(query, conn)
    print(data)
    conn.close()

Solution

  • There are several ways to do ssh port forwarding for R. In no particular order:

    1. I forward it externally to R. All of my work is remote, and for one particular client I need access to various instances of SQL Server, Redis, MongoDB, remote filesystems, and a tunnel-hop to another network only accessible from the ssh bastion host. I tend to do work in more than R, so it's important to me that I generalize this. It is not for everybody or every task.

      For this, I used a mismash of autossh and my ssh-agent (in KeePass/KeeAgent).

    2. The ssh package does have a function to Create a Tunnel. The premise is that you have already created a "session" to which you can add a forwarding rule(s). When using ssh::ssh_tunnel, it is blocking, meaning you cannot use it in the same R process and continue to work. Demo:

      # R session 1
      sess <- ssh::ssh_connect("user@remote")
      # insert passphrase
      ssh::ssh_tunnel(sess, 21433, "otherremote:1433")
      # / Waiting for connection on port 21433...
      
      # R session 2
      con <- DBI::dbConnect(..., port=21433)
      DBI::dbGetQuery(con, "select 1 as n")
      #   n
      # 1 1
      

      This connection will stay alive so long as con is not closed and the remote end does not close it (e.g., activity timeout).

      Note: I cannot get the ssh package to use my ssh-agent, so all passwords must be typed in or otherwise passed in not-ideal ways. There are many ways to not have to type it, such as using the keyring package (secure) or envvars, both of which would pass the password to ssh_connect(..., passwd=<>).

    3. The above, but using callr so that you don't need to explicit sessions active (though you will still have another R session.

      bgr <- callr::r_bg(function() {
        ssh <- ssh::ssh_connect("r2@remote", passwd=keyring::key_get("r2", "remote"))
        ssh::ssh_tunnel(ssh, port=21433, "otherremote:1433")
      }, supervise = TRUE)
      DBI::dbGetQuery(con, "select 1 as n")
      #   n
      # 1 1
      ### when your work is done
      bgr$kill()
      

      If you do this, I strongly recommend the use of supervise=TRUE, which ensures the background R process is killed when this (primary) R session exits. This will reduce the risk of having phantom unused R sessions hanging around; in addition to just clogging up the process tree, if one of these phantom R processes is actively forwarding a port, that means nothing else can forward that port. This allows you to continue working, but you are not longer in control of the process doing the forwarding ... and subsequent attempts to tunnel will fail.

      FYI, I generally prefer using keyring::key_get("r2", "remote") for password management in situations like this: (1) it prevents me from having to set that envvar each time I start R ... which will inadvertently store the plain-string password in ~/.Rhistory, if saved; (2) it prevents me from having to set that envvar in the global environment permanently, which is prone to other stupid mistakes I make; and (3) is much better protected since it is using the native credentials of your base OS. Having said that, you can replace the above use of keyring::key_get(..) with Sys.getenv("mypass") in a pinch, or in a case where the code is running on a headless system where a credentials manager is unavailable.

      And if you want this to be a little more resilient to timeout disconnects, you can instead use

      bgr <- callr::r_bg(function() {
        ssh <- ssh::ssh_connect("r2@remote", passwd=keyring::key_get("r2", "remote"))
        while (!inherits(try(ssh::ssh_tunnel(ssh, port=21433, "otherremote:1433"), silent=TRUE), "try-error")) Sys.sleep(1)
      }, supervise = TRUE)
      

      which will repeatedly make the tunnel so long as the attempt does not error. You may need to experiment with this to get it "perfect".

    4. callr is really just using processx under the hood to start a background R process and allow you to continue working. If you don't want the "weight" of another R process solely to forward ports, you can use processx to start an explicit call to ssh that does everything you need it to do.

      proc <- processx::process$new("ssh", c("-L", "21433:otherremote:1433", "r2@remote", "-N"))
      ### prompts for password
      DBI::dbGetQuery(con, "select 1 as n")
      #   n
      # 1 1
      ### when done
      proc$kill()
      # [1] TRUE