I want to use the Python library rapidjson in my Airflow DAG. My code repo is hosted on Git. Whenever I merge something into the master or test branch, the changes are automatically configured to reflect on the Airflow UI.
My Airflow is hosted as a VM on AWS EC2. Under the EC2 instances, I see three different instances for: scheduler, webserver, workers.
I connected to these 3 individually via Session Manager. Once the terminal opened, I installed the library using
pip install python-rapidjson
I also verified the installation using pip list
. Now, I import the library in my dag's code simply like this:
import rapidjson
However, when I open the Airflow UI, my DAG has an error that:
No module named 'rapidjson'
Are there additional steps that I am missing out on? Do I need to import it into my Airflow code base in any other way as well?
Within my Airflow git repository, I also have a "requirements.txt" file. I tried to include
this there as well but I do not know how to actually install this.
I tried this:
pip install requirements.txt
within the session manager's terminal as well. However, the terminal is not able to locate this file. In fact, when I do "ls", I don't see anything.
Have you tried using the PythonVirtualEnvOperator
It will allow you to install the library at runtime so you don't need to make changes on the server just for one job.
To run a function called my_callable
, simply use the following:
from airflow.operators.python import PythonVirtualenvOperator
my_task = PythonVirtualenvOperator(
task_id="my_task ",
I still recommend updating your server environment for core libs, but this is a best practice when using special libs for a small minority of jobs.