I have a dlt table defined in my DLT notebook that should run exactly once. However, it runs always a couple or more times. It is as simple as this. This gives me errors when defining other tables. Why? Is DLT parallelizing my function and that's why is called multiple times?
@dlt.table(
comment="Silver table for silver_checks",
name = 'silver_checks'
)
def table_checks():
first_df_checks = True
print('hello world')
return dlt.readStream('computer_vision_dlt.db/dim_logmap')
Standard Output:
To connect another client to this kernel, use:
--existing /databricks/kernel-connections/ffe608650a3809cf191a3bc79503829585c8aa331d0ed24fde1b65d99f4835e9.json
2022-11-15T14:49:58.432+0000: [GC (Allocation Failure) [PSYoungGen: 2207232K->23793K(2436608K)] 3093469K->910030K(7795712K), 0.0148085 secs] [Times: user=0.08 sys=0.00, real=0.01 secs]
hello world
hello world
I expect to run only once, or what is the same, that it print only one "Hello world".
This is how DLT works - when process is starting, all functions with DLT annotations attached are evaluated without data to build an execution graph, infer the schemas, perform checks of the schemas, etc. And actual processing is starting only after the execution graph is built & validated.
Because DLT is designed to be a declarative, you need to avoid any side effects inside annotated functions, as functions could be called multiple times even without initialization, for example, for partial refresh...