python pandas etl airflow airflow-scheduler

How to use airflow for orchestrating simple pandas etl python scripts?

I love the idea of airflow but I'm stuck in the basics. Since yesterday I have airflow running on a vm ubuntu-postgres solution. I can see the dashboard and the example data :)) What I want now is to migrate an example script which I use to process raw to prepared data.

Imagine u have a folder of csv files. Today my script iterates through it, passing each file to a list which is going to be converted into a df. After that I prepare their columns names and do some data cleaning and write it into a different format.

1: pd.read_csv for files in directory

2: create a df

3: clean column names

4: clean values (parallel to stp 3)

5: write the result to a database

How would I have to organize my files according to airflow? How should the script look like? Am I passing a single method, a single file or do I have to create several files for each part? I'm lacking the basic concept at this point :( Everything I read about airflow is way more complex than my simple case. I was considering to step away from airflow as well to Bonobo, Mara, Luigi, but I think airflow is worth it?!

Solution

I'd use the PythonOperator, put the whole code into a Python function, create one Airflow task and that's it.

It would also be possible to put the loading of the csv files in a function and the database writing as well, if it is neccessary to split those steps. All this would be put in one single DAG.

So your one DAG would have three tasks like:

loadCSV (PythonOperator)
parseDF (PythonOperator)
pushToDB (PythonOperator)

If you use several tasks you need to use Airflow's XCom. In the beginning it is easier to just use one task.

There are several code examples here under the tag airflow. When you have created something, ask again.