Search code examples
scalaapache-sparkdatabricksdatabricks-sql

Databricks - All records from Dataframe/Tempview get removed after merge


I am observing some weired issue. I am not sure whether it is lack of my knowledge in spark or what.

I have a dataframe as shown in below code. I create a tempview from it and I am observing that after merge operation, that tempview becomes empty. Not sure why.

val myDf = getEmployeeData()
myDf.createOrReplaceTempView("myView")

 // Result1: Below lines display all the records 
 myDf.show()
 spark.Table("myView").show()

// performing merge operation
val sql = s"""MERGE INTO employee AS a
        USING  myView AS b
        ON a.Id = b.Id
        WHEN MATCHED THEN UPDATE SET *
        WHEN NOT MATCHED THEN INSERT *"""

 spark.sql(sql)
 
 // Result2: ISSUE is here. myDf & mvView both are empty 
 myDf.show()
 spark.Table("myView").show()

Edit

getEmployeeData method performs join between two dataframes and returns the result.

df1.as(df1Alias).join(df2.as(df2Alias), expr(joinString), "inner").filter(finalFilterString).select(s"$df1Alias.*")

Solution

  • Dataframes in Spark are lazily evaluated, ie. not executed until an action like .show, .collect is executed or they're used in SQL DDL operation. This also means that if you refer to it once more, it will get reevaluated again.

    Assuming there's no other background activity that can mess up, apparently your function getEmployeeData, depends on employee table. It gets executed both before and after the merge and might yield different result.

    To prevent it you can checkpoint the dataframe:

    myView.checkpoint()
    

    or explicitly materialize it:

    myView.write.saveAsTable("myViewMaterialized")
    

    and later refer to the materialized version.