Search code examples
pythongoogle-cloud-platformairflowgoogle-cloud-composer

Cloud composer writes file and file that disappears after dag execution


I was trying to write a txt file to the dags folder in a cloud composer DAG. The file was never showing up and I thought there was something wrong with my code, but I tried saving a pandas dataframe in xlsx format do the DAGs folder and load that dataframe.

It turns out it worked. In the same code I was able to write the pandas dataframe and then read it in the same DAG run, but when I looked in the folder afterwards there was no file. If I run the code again and try to read it, it says the file doesn't exists.

It's like the file gets written only temporarily.

I'm also using the folder's full path ("/home/airflow/gcs/dags"), and since I'm trying to save the file to the dags folder my composer belongs to, I thought I shouldn't be facing this much trouble.

Anyone has any thoughts on how I can solve this?

EDIT:

Snippet of code:

def _crawl_spiders():
    # sets working dir
    os.chdir('/home/airflow/gcs/dags/mypath')


    df = pd.read_excel('./x-path/sheet.xlsx')

    df.to_excel('/home/airflow/gcs/dags/mypath/test.xlsx', index = False)
    b = pd.read_excel('/home/airflow/gcs/dags/mypath/test.xlsx')
    print(f'Success, b columns:{b.columns}')

with DAG(dag_id="crawler", start_date=datetime(2022,7,28),
    schedule_interval='@daily', tags=['muffet', 'crawler']) as dag:

    crawl_spiders = PythonOperator(
    task_id = 'crawl_spiders',
    python_callable  = _crawl_spiders,
    dag = dag)```

Solution

  • You may have better luck saving your *.xlsx file to the /home/airflow/gcs/data directory rather than /home/airflow/gcs/dags.

    (Or save the dataframe to a local *.xlsx and use the Google Cloud Storage API Client library to upload it to a GCS path.)

    The docs indicate that the Cloud Composer environment is set up to sync the /home/airflow/gcs/dags contents with its corresponding GCS path "unidirectionally", i.e. from GCS to the local directories only, not in the other direction: "Unidirectional synching means that local changes in these folders are overwritten." This is why the saved *.xlsx file is momentarily available but disappears later.

    Below are the relevant excerpts from the docs.


    From the Folders in the Cloud Storage Bucket section of the Data Stored in Cloud Storage page:

    Cloud Composer stores the source code for your workflows (DAGs) and their dependencies in specific folders in Cloud Storage and uses Cloud Storage FUSE to map the folders to the Airflow instances in your Cloud Composer environment.

    The local directories mapped to GCS paths are as follows. In particular, the /data directory "stores the data that tasks produce and use. This folder is mounted on all worker nodes."

    • /home/airflow/gcs/dags
    • /home/airflow/gcs/plugins
    • /home/airflow/gcs/data
    • /home/airflow/gcs/logs

    From the Data Synchronization section of the same page:

    Cloud Composer synchronizes the dags/ and plugins/ folders uni-directionally by copying locally. Unidirectional synching means that local changes in these folders are overwritten.

    The data/ and logs/ folders synchronize bi-directionally by using Cloud Storage FUSE.