I have following code and I would like to initialize the delta table and capture the instance of the pyarrow dataset so that I can materialize with a different filter later. This should then pushdown and prune the loaded data. Understand is that df is not in the scope but is there a workaround that i can save it and use in a later point?
The context where i want to use it is dbt-duckdb adapter, the problem is that initialization of the source view/table and node which is using that view is in different scopes
import duckdb
from deltalake import DeltaTable
conn = duckdb.connect("./data.duckdb")
def create_view(conn):
delta_path = "../utils/sf1/delta_partition/customer"
dt = DeltaTable(delta_path)
dataset = dt.to_pyarrow_dataset()
df = duckdb.arrow(dataset)
conn.sql("CREATE OR REPLACE VIEW test AS SELECT * from df")
return conn
##df is here droped
conn = create_view(conn)
# df not accessable
print(conn.sql("SELECT * FROM test WHERE c_mktsegment = 'BUILDING' and c_nationkey=15").explain())
#conn.sql("SELECT * FROM test_materialized").show()
conn.close()
This code throws:
10 return conn
12 conn = create_view(conn)
---> 14 print(conn.sql("SELECT * FROM test WHERE c_mktsegment = 'BUILDING' and c_nationkey=15").explain())
16 #conn.sql("SELECT * FROM test_materialized").show()
18 conn.close()
CatalogException: Catalog Error: Table with name df does not exist!
Did you mean "test"?
I tried locally everything in the same context and the pushdown predicates are propagated over view to the deltatable
I think you just want to register
the dataset and refer to that instead of making a view:
conn.register("delta_dataset", dataset)
conn.sql("select * from delta_dataset")
I am afraid the dataframe might go out of scope in your example