pythonsshairflow

Airflow SSHExecuteOperator() with env=... not setting remote environment


I am modifying the environment of the calling process and appending to it's PATH along with setting some new environment variables. However, when I print os.environ in the child process, these changes are not reflected. Any idea what may be happening?

My call to the script on the instance:

ssh_hook = SSHHook(conn_id=ssh_conn_id)
temp_env = os.environ.copy()
temp_env["PATH"] = "/somepath:"+temp_env["PATH"]
run = SSHExecuteOperator(
        bash_command="python main.py",
        env=temp_env,
        ssh_hook=ssh_hook,
        task_id="run",
        dag=dag)

Solution

  • Explanation: Implementation Analysis

    If you look at the source to Airflow's SSHHook class, you'll see that it doesn't incorporate the env argument into the command being remotely run at all. The SSHExecuteOperator implementation passes env= through to the Popen() call on the hook, but that only passes it through to the local subprocess.Popen() implementation, not to the remote operation.

    Thus, in short: Airflow does not support passing environment variables over SSH. If it were to have such support, it would need to either incorporate them into the command being remotely executed, or to add the SendEnv option to the ssh command being locally executed for each command to be sent (which even then would only work if the remote sshd were configured with AcceptEnv whitelisting the specific environment variable names to be received).


    Workaround: Passing Environment Variables On The Command Line

    from pipes import quote # in Python 3, make this "from shlex import quote"
    
    def with_prefix_from_env(env_dict, command=None):
        result = 'set -a; '
        for (k,v) in env_dict.items():
            result += '%s=%s; ' % (quote(k), quote(v))
        if command:
            result += command
        return result
    
    SSHExecuteOperator(bash_command=prefix_from_env(temp_env, "python main.py"),
                       ssh_hook=ssh_hook, task_id="run", dag=dag)
    

    Workaround: Remote Sourcing

    If your environment variables are sensitive and you don't want them to be logged with the command, you can transfer them out-of-band and source the remote file containing them.

    from pipes import quote
    
    def with_env_from_remote_file(filename, command):
      return "set -a; . %s; %s" % (quote(filename), command)
    
    SSHExecuteOperator(bash_command=with_env_from_remote_file(envfile, "python main.py"),
                       ssh_hook=ssh_hook, task_id="run", dag=dag)
    

    Note that set -a directs the shell to export all defined variables, so the file being executed need only define variables with key=val declarations; they'll be automatically exported. If generating this file from your Python script, be sure to quote both keys and values with pipes.quote() to ensure that it only performs assignments and does not run other commands. The . keyword is a POSIX-compliant equivalent to the bash source command.