Search code examples
apache-sparkazure-synapse

How to create a Temporary Table in Spark-SQL (not a Temp View)


We're migrating code from T-SQL to Spark SQL; it'd be helpful to have Temp Table equivalents.

An earlier question on this, is answered with Temporary Views.

%%SQL
CREATE OR ALTER TEMP View MyTmpView AS...

But these are literally Views. Most of the time a Spark Temp View can take the place of a SQL Temp Table, but not always.

Example:

  1. Create a Temp View, MyTmpView, of all Employees not in DimEmployee
  2. INSERT all rows from MyTmpView, INTO DimEmployee
  3. Now use MyTmpView for something else (a second INSERT, a JOIN, etc.). You can't - it's empty, since it's a View, which if ran now, would logically return nothing after that INSERT in step 2.

(Sidenote: Our environment is specifically Azure Synapse Spark, if it matters.)

Edit: I accepted an answer - but in reality, in Spark SQL, there are no equivalents to SQL Temp Tables. If one wants, one can go into PySpark dataframes which can provide similar functionality. For our use cases, it would make more sense to work around the limited situations, rather than move into dataframes.


Solution

  • In Spark SQL, you can achieve similar functionality to SQL Temporary Tables using DataFrames or by creating temporary views. However, there are some differences to consider when working with temporary views in Spark, for example when you want to perform multiple operations.

    Temporary Views:

    • Temporary views in Spark are similar to SQL views, as you mentioned.
    • Temporary views are commonly used to represent temporary data for the duration of a Spark session
    • They don't hold data but provide a structured schema on top of an existing DataFrame or dataset.
    • If you perform an operation like an INSERT, the view won't reflect the changes, and it will still return the original data.

    DataFrames:

    • DataFrames are a more suitable option if you need to perform multiple operations on the same data, including INSERTs.
    • DataFrames are distributed collections of data that can be manipulated using Spark SQL operations.
    • You can create a DataFrame from your data source and perform various transformations, including filters, joins, and inserts.

    The below is the approach how you can perform the operations you mentioned using DataFrames in Spark SQL:

    from pyspark.sql import SparkSession
    from pyspark.sql import Row
    spark = SparkSession.builder.appName("example").getOrCreate()
    dimEmployee = spark.createDataFrame([
        Row(EmployeeID=1, FirstName="John", LastName="Doe"),
        Row(EmployeeID=2, FirstName="Jane", LastName="Smith"),
        Row(EmployeeID=3, FirstName="Bob", LastName="Johnson")
    ])
    employee_data = spark.createDataFrame([
        Row(EmployeeID=4, FirstName="Alice", LastName="Brown"),
        Row(EmployeeID=5, FirstName="Eva", LastName="Williams"),
        Row(EmployeeID=6, FirstName="Charlie", LastName="Davis")
    ])
    filtered_data = employee_data.exceptAll(dimEmployee)
    display(filtered_data)
    dimEmployee = dimEmployee.union(filtered_data)
    display(dimEmployee)
    other_data = spark.createDataFrame([
        Row(EmployeeID=4, Department="HR"),
        Row(EmployeeID=5, Department="IT"),
        Row(EmployeeID=7, Department="Sales")
    ])
    joined_data = filtered_data.join(other_data, "EmployeeID", "left")
    display(joined_data)
    

    enter image description here

    enter image description here

    enter image description here

    • In the above code Created two DataFrames dimEmployee and employee_data with sample employee data.

    • Filtering the data in employee_data that is not in dimEmployee Using the exceptAll method to filter out rows from employee_data that are present in dimEmployee. The filtered result is stored in the filtered_data DataFrame

    • union the filtered_data DataFrame with the dimEmployee DataFrame,adding the rows from filtered_data to dimEmployee.

    • A new DataFrame other_data with some department information for employees.

    • Join filtered_data with other_data based on the "EmployeeID" column using a left join.

    • The code performs a left join between filtered_data and other_data on the "EmployeeID" column and stores the result in the joined_data DataFrame.