Search code examples
duckdb

Dataframe not accessable from another context if captured in a VIEW


I have following code and I would like to initialize the delta table and capture the instance of the pyarrow dataset so that I can materialize with a different filter later. This should then pushdown and prune the loaded data. Understand is that df is not in the scope but is there a workaround that i can save it and use in a later point?

The context where i want to use it is dbt-duckdb adapter, the problem is that initialization of the source view/table and node which is using that view is in different scopes

import duckdb
from deltalake import DeltaTable
conn = duckdb.connect("./data.duckdb")
def create_view(conn):
    delta_path = "../utils/sf1/delta_partition/customer"
    dt = DeltaTable(delta_path)
    dataset = dt.to_pyarrow_dataset()
    df = duckdb.arrow(dataset)
    conn.sql("CREATE OR REPLACE VIEW test AS SELECT * from df")
    return conn
    ##df is here droped

conn = create_view(conn)

# df not accessable
print(conn.sql("SELECT * FROM test WHERE c_mktsegment = 'BUILDING' and c_nationkey=15").explain())

#conn.sql("SELECT * FROM test_materialized").show()

conn.close()

This code throws:

     10     return conn
     12 conn = create_view(conn)
---> 14 print(conn.sql("SELECT * FROM test WHERE c_mktsegment = 'BUILDING' and c_nationkey=15").explain())
     16 #conn.sql("SELECT * FROM test_materialized").show()
     18 conn.close()

CatalogException: Catalog Error: Table with name df does not exist!
Did you mean "test"?

I tried locally everything in the same context and the pushdown predicates are propagated over view to the deltatable


Solution

  • I think you just want to register the dataset and refer to that instead of making a view:

    conn.register("delta_dataset", dataset)
    conn.sql("select * from delta_dataset")
    

    I am afraid the dataframe might go out of scope in your example