Consider the following scenario on Azure Databricks.
spark.table("table1").createOrReplaceTempView("test_view")
df = test_view.drop("col1")
spark.sql("DROP TABLE table1")
df1 =df.createOrReplaceTempView("test_view2")
df1.display()
In the code above, the action is only in the last line. How does the df show the output even though the underlying table is dropped before the action and Spark works on lazy evaluation concept.
I was expecting to see an error saying the table cannot be resolved but to my surprise the display shows output.
After reading your comments I kind of realized where you have a misunderstanding - let's tackle that.
You are right spark follows lazy evaluation - which just means that - it will not "materialize" the data until an action is called. Basically, the data will come into memory for processing only after an action is triggered.
Now coming to your code:
spark.table("table1").createOrReplaceTempView("test_view")
: This will create (or replace) a temporary view (not materialized) of the table table1
which you can query using spark SQL.df = test_view.drop("col1")
: This is a wrong statement. Temporary views are only accessible to be queried using spark.sql()
and you are treating test_view
as a dataframe which is not allowed. You can replace the first statement with test_view = spark.table("table1")
to make this legal. Now, this is a transformation. So, the dataset is still not materialized.spark.sql("DROP TABLE table1")
: This drops the table table1
from the spark catalog.df1 = df.createOrReplaceTempView("test_view2")
: This is a legal statement but df1
is not a df. It's a NoneType
Object because createOrReplaceTempView()
doesn't return a dataframe. Replace it with df1 = df
which is also kind of unnecessary but to keep up with your code I am going with this.df1.display()
: Finally, some action time. Here's what's going to happen now:
test_view = spark.table("table1")
: test_view
gets the copy of the data from table1
in dataframe format.df = test_view.drop("col1")
: The col1
is dropped from the test_view
and a new copy of the dataframe is returned. That copy is stored in the df
dataframe now.spark.sql("DROP TABLE table1")
: This drops the table1
from the spark catalog. NOTE: Now the real doubt you have is if this table is dropped then is why the data is showing? Remember the df
dataframe has the copy of test_view
after dropping col1
- which itself was a copy of the table1
- So that implies that the data is still in memory - both in df
and test_view
dataframes.df1 = df
-> df1.display()
: Similarly, df1
is just another copy and in memory, therefore, it will show the output without throwing any error because it does not have any direct dependence on table1
which was already stored as copies in df
and test_view
which are still in memory and therefore we can see the data even after the table1
is dropped.I hope it's somewhat clear now.