Search code examples
apache-sparkdataset

is there any performance hit when calling createOrReplaceTempView on a Spark Dataset?


In my code we use a lot createOrReplaceTempView so that we can invoke SQL on the generated view. This is done on multiple stages of the transformation. It also help us to keep the code in modules each performing a particular operation. A sample code below to put in context my question is shown below. So my questions are:

  1. What is the performance penalty if any by creating the temp view from the Dataset?

  2. When I create more than one from each transformation does this increases memory size?

  3. What is the life cycle of those views and is there any function call to remove them?

val dfOne = spark.read.option("header",true).csv("/apps/cortex/landing/auth/cof_auth.csv")
dfOne.createOrReplaceTempView("dfOne")
val dfTwo = spark.sql("select * from dfOne where column_one=1234567890")
dfTwo.createOrReplaceTempView("dfTwo")
val dfThree = spark.sql("select column_two, count(*) as count_two from dfTwo")
dfTree.createOrReplaceTempView("dfThree")

Solution

  • No.

    From the manuals on

    Running SQL Queries Programmatically

    The sql function on a SparkSession enables applications to run SQL queries programmatically and returns the result as a DataFrame.

    In order to do this you register the dataFrame as a SQL temporary view. This is a "lazy" artefact and there must already be a data frame / dataset present. It's just needs registering to allow the SQL interface.