Search code examples
sqldataframepysparkdatabricks

How does spark show the output of a dataframe even though the table from which the df is based on is deleted?


Consider the following scenario on Azure Databricks.

spark.table("table1").createOrReplaceTempView("test_view") 
df = test_view.drop("col1") 
spark.sql("DROP TABLE table1") 
df1 =df.createOrReplaceTempView("test_view2") 
df1.display()

In the code above, the action is only in the last line. How does the df show the output even though the underlying table is dropped before the action and Spark works on lazy evaluation concept.

I was expecting to see an error saying the table cannot be resolved but to my surprise the display shows output.


Solution

  • After reading your comments I kind of realized where you have a misunderstanding - let's tackle that.

    You are right spark follows lazy evaluation - which just means that - it will not "materialize" the data until an action is called. Basically, the data will come into memory for processing only after an action is triggered.

    Now coming to your code:

    1. spark.table("table1").createOrReplaceTempView("test_view"): This will create (or replace) a temporary view (not materialized) of the table table1 which you can query using spark SQL.
    2. df = test_view.drop("col1"): This is a wrong statement. Temporary views are only accessible to be queried using spark.sql() and you are treating test_view as a dataframe which is not allowed. You can replace the first statement with test_view = spark.table("table1") to make this legal. Now, this is a transformation. So, the dataset is still not materialized.
    3. spark.sql("DROP TABLE table1"): This drops the table table1 from the spark catalog.
    4. df1 = df.createOrReplaceTempView("test_view2"): This is a legal statement but df1 is not a df. It's a NoneType Object because createOrReplaceTempView() doesn't return a dataframe. Replace it with df1 = df which is also kind of unnecessary but to keep up with your code I am going with this.
    5. df1.display(): Finally, some action time. Here's what's going to happen now:
      • test_view = spark.table("table1"): test_view gets the copy of the data from table1 in dataframe format.
      • df = test_view.drop("col1"): The col1 is dropped from the test_view and a new copy of the dataframe is returned. That copy is stored in the df dataframe now.
      • spark.sql("DROP TABLE table1"): This drops the table1 from the spark catalog. NOTE: Now the real doubt you have is if this table is dropped then is why the data is showing? Remember the df dataframe has the copy of test_view after dropping col1 - which itself was a copy of the table1 - So that implies that the data is still in memory - both in df and test_view dataframes.
      • df1 = df -> df1.display(): Similarly, df1 is just another copy and in memory, therefore, it will show the output without throwing any error because it does not have any direct dependence on table1 which was already stored as copies in df and test_view which are still in memory and therefore we can see the data even after the table1 is dropped.

    I hope it's somewhat clear now.