I want to use the Python library rapidjson in my Airflow DAG. My code repo is hosted on Git. Whenever I merge something into the master or test branch, the changes are automatically configured to reflect on the Airflow UI.
My Airflow is hosted as a VM on AWS EC2. Under the EC2 instances, I see three different instances for: scheduler, webserver, workers.
I connected to these 3 individually via Session Manager. Once the terminal opened, I installed the library using
pip install python-rapidjson
I also verified the installation using pip list
. Now, I import the library in my dag's code simply like this:
import rapidjson
However, when I open the Airflow UI, my DAG has an error that:
No module named 'rapidjson'
Are there additional steps that I am missing out on? Do I need to import it into my Airflow code base in any other way as well?
Within my Airflow git repository, I also have a "requirements.txt" file. I tried to include
python-rapidjson==1.5.5
this there as well but I do not know how to actually install this.
I tried this:
pip install requirements.txt
within the session manager's terminal as well. However, the terminal is not able to locate this file. In fact, when I do "ls", I don't see anything.
pwd
/var/snap/amazon-ssm-agent/6522
Have you tried using the PythonVirtualEnvOperator
?
It will allow you to install the library at runtime so you don't need to make changes on the server just for one job.
To run a function called my_callable
, simply use the following:
from airflow.operators.python import PythonVirtualenvOperator
my_task = PythonVirtualenvOperator(
task_id="my_task ",
requirements="python-rapidjson==1.5.5",
python_callable=my_callable,
)
I still recommend updating your server environment for core libs, but this is a best practice when using special libs for a small minority of jobs.