I'm using Airflow for some ETL things and in some stages, I would like to use temporary tables (mostly to keep the code and data objects self-contained and to avoid to use a lot of metadata tables).
Using the Postgres connection in Airflow and the "PostgresOperator" the behaviour that I found was: For each execution of a PostgresOperator we have a new connection (or session, you name it) in the database. In other words: We lose all temporary objects of the previous component of the DAG.
To emulate a simple example, I use this code (do not run, just see the objects):
import os
from airflow import DAG
from airflow.operators.postgres_operator import PostgresOperator
default_args = {
'owner': 'airflow'
,'depends_on_past': False
,'start_date': datetime(2018, 6, 13)
,'retries': 3
,'retry_delay': timedelta(minutes=5)
}
dag = DAG(
'refresh_views'
, default_args=default_args)
# Create database workflow
drop_exist_temporary_view = "DROP TABLE IF EXISTS temporary_table_to_be_used;"
create_temporary_view = """
CREATE TEMPORARY TABLE temporary_table_to_be_used AS
SELECT relname AS views
,CASE WHEN relispopulated = 'true' THEN 1 ELSE 0 END AS relispopulated
,CAST(reltuples AS INT) AS reltuples
FROM pg_class
WHERE relname = 'some_view'
ORDER BY reltuples ASC;"""
use_temporary_view = """
DO $$
DECLARE
is_correct integer := (SELECT relispopulated FROM temporary_table_to_be_used WHERE views LIKE '%<<some_name>>%');
BEGIN
start_time := clock_timestamp();
IF is_materialized = 0 THEN
EXECUTE 'REFRESH MATERIALIZED VIEW ' || view_to_refresh || ' WITH DATA;';
ELSE
EXECUTE 'REFRESH MATERIALIZED VIEW CONCURRENTLY ' || view_to_refresh || ' WITH DATA;';
END IF;
END;
$$ LANGUAGE plpgsql;
"""
# Objects to be executed
drop_exist_temporary_view = PostgresOperator(
task_id='drop_exist_temporary_view',
sql=drop_exist_temporary_view,
postgres_conn_id='dwh_staging',
dag=dag)
create_temporary_view = PostgresOperator(
task_id='create_temporary_view',
sql=create_temporary_view,
postgres_conn_id='dwh_staging',
dag=dag)
use_temporary_view = PostgresOperator(
task_id='use_temporary_view',
sql=use_temporary_view,
postgres_conn_id='dwh_staging',
dag=dag)
# Data workflow
drop_exist_temporary_view >> create_temporary_view >> use_temporary_view
At the end of execution, I receive the following message:
[2018-06-14 15:26:44,807] {base_task_runner.py:95} INFO - Subtask: psycopg2.ProgrammingError: relation "temporary_table_to_be_used" does not exist
Someone knows if Airflow has some way to retain the same connection to the database? I think it can save a lot of work in creating/maintaining several objects in the database.
You can retain the connection to the database by building a custom Operator which leverages the PostgresHook to retain a connection to the db while you perform some set of sql operations.
You may find some examples in contrib on incubator-airflow or in Airflow-Plugins.
Another option is to persist this temporary data to XCOMs. This will give you the ability to keep the metadata used with the task in which it was created. This may help troubleshooting down the road.