Search code examples
proxyairflowceleryairflow-2.x

Setting proxy environment variables when running a DAG in Apache Airflow


I need to run Apache Airflow in a corporate network. For that I need to set "http_proxy", "https_proxy" and "no_proxy" in any machine I want to use internet.

Right now, the VM that I'm using to run Airflow stores these env. variables in /etc/profile.

I can run Python scripts that make HTTP requests to external websites with ease, when I run them on the terminal, but when I run them inside a DAG, it breaks because it couldn't resolve/access the address.

It seems that Airflow runs scripts in an isolated environment. I am currently using CeleryExecutor.

Firstly, I've accessed all the environment variables with a print(environ). I got this:

environ({'LANG': 'en_US.UTF-8', 'PATH': '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin', 'HOME': '/home/airflow', 'LOGNAME': 'airflow', 'USER': 'airflow', 'SHELL': '/bin/bash', 'INVOCATION_ID': '5c777ce3b07748309b972d877a0545ea', 'JOURNAL_STREAM': '9:37430', 'AIRFLOW_CONFIG': '/opt/airflow/airflow.cfg', 'AIRFLOW_HOME': '/opt/airflow', '_MP_FORK_LOGLEVEL_': '20', '_MP_FORK_LOGFILE_': '', '_MP_FORK_LOGFORMAT_': '[%(asctime)s: %(levelname)s/%(processName)s] %(message)s', 'CELERY_LOG_LEVEL': '20', 'CELERY_LOG_FILE': '', 'CELERY_LOG_REDIRECT': '1', 'CELERY_LOG_REDIRECT_LEVEL': 'WARNING', 'AIRFLOW_CTX_DAG_OWNER': 'airflow', 'AIRFLOW_CTX_DAG_ID': 'primeiro-teste', 'AIRFLOW_CTX_TASK_ID': 'extract', 'AIRFLOW_CTX_EXECUTION_DATE': '2022-12-13T16:18:17.185417+00:00', 'AIRFLOW_CTX_DAG_RUN_ID': 'manual__2022-12-13T16:18:17.185417+00:00'})

There is no proxy variables, so the script cannot access outside information.

I've even debugged within a DAG which were the DNS servers, to see if they were correct. The result was positive.

The only way I got the script to work was by getting these environ variables defined before running an HTTP request:

os.environ['HTTP_PROXY'] = os.environ['http_proxy'] = os.environ['HTTPS_PROXY'] = os.environ['https_proxy'] = "PROXY STRING"

I was hoping to find a way to get these variables defined for all DAGs, but when I set them like Tomasz, I can't seem to use them if they don't start with the "AIRFLOW" prefix.


Solution

  • Creating an environment file and putting it in some location is not sufficient. You have to tell Airflow about the location of that file when it starts, however you do that (e.g. systemd).

    Airflow gets its environment variables very specifically. When Airflow starts you need to reference the environment file created for Airflow. When you run Airflow using systemd you can specify which EnvironmentFile that you would like Airflow to use, under the [Service] section of the unit file. Environment variables not defined within that file will not be picked up by Airflow. Your unit files may look different to mine but here is mine as an example:

    [Unit]
    Description=Airflow webserver daemon
    After=network.target mysqld.service rabbitmq-server.service
    Wants=mysqld.service rabbitmq-server.service
    
    [Service]
    EnvironmentFile=/prod/airflow/airflow.env
    User=airflow
    Group=airflow
    Type=simple
    ExecStart=/usr/bin/bash -c "source /prod/airflow/airflow_38_venv/bin/activate ; /prod/airflow/airflow_38_venv/bin/airflow webserver -p 7635 --pid /prod/airflow/run/webserver.pid"
    Restart=on-failure
    RestartSec=5s
    PrivateTmp=true
    
    [Install]
    WantedBy=multi-user.target
    

    EnvironmentFile can point to any location/filename that the user running Airflow has read access to. The suggested filename and location are /etc/sysconfig/airflow but as you can see mine is different than what is recommended.

    Here is what the body of my EnvironmentFile looks like, edited to remove specific details. Again, yours will probably look different.

    $ cat /prod/airflow/airflow.env
    # This file is the environment file for Airflow. Put this file in /etc/sysconfig/airflow per default
    # configuration of the systemd unit files.
    #
    AIRFLOW_CONFIG=/prod/airflow/airflow.cfg
    AIRFLOW_HOME=/prod/airflow
    http_proxy=http://something.proxyserver.com:80
    https_proxy=http://something.proxyserver.com:80
    no_proxy=*.google.com,127.0.0.1
    HTTP_PROXY=http://something.proxyserver.com:80
    HTTPS_PROXY=http://something.proxyserver.com:80
    NO_PROXY=*.google.com,127.0.0.1