We're migrating code from T-SQL to Spark SQL; it'd be helpful to have Temp Table equivalents.
An earlier question on this, is answered with Temporary Views.
%%SQL
CREATE OR ALTER TEMP View MyTmpView AS...
But these are literally Views. Most of the time a Spark Temp View can take the place of a SQL Temp Table, but not always.
Example:
(Sidenote: Our environment is specifically Azure Synapse Spark, if it matters.)
Edit: I accepted an answer - but in reality, in Spark SQL, there are no equivalents to SQL Temp Tables. If one wants, one can go into PySpark dataframes which can provide similar functionality. For our use cases, it would make more sense to work around the limited situations, rather than move into dataframes.
In Spark SQL, you can achieve similar functionality to SQL Temporary Tables using DataFrames or by creating temporary views. However, there are some differences to consider when working with temporary views in Spark, for example when you want to perform multiple operations.
Temporary Views:
DataFrames:
The below is the approach how you can perform the operations you mentioned using DataFrames in Spark SQL:
from pyspark.sql import SparkSession
from pyspark.sql import Row
spark = SparkSession.builder.appName("example").getOrCreate()
dimEmployee = spark.createDataFrame([
Row(EmployeeID=1, FirstName="John", LastName="Doe"),
Row(EmployeeID=2, FirstName="Jane", LastName="Smith"),
Row(EmployeeID=3, FirstName="Bob", LastName="Johnson")
])
employee_data = spark.createDataFrame([
Row(EmployeeID=4, FirstName="Alice", LastName="Brown"),
Row(EmployeeID=5, FirstName="Eva", LastName="Williams"),
Row(EmployeeID=6, FirstName="Charlie", LastName="Davis")
])
filtered_data = employee_data.exceptAll(dimEmployee)
display(filtered_data)
dimEmployee = dimEmployee.union(filtered_data)
display(dimEmployee)
other_data = spark.createDataFrame([
Row(EmployeeID=4, Department="HR"),
Row(EmployeeID=5, Department="IT"),
Row(EmployeeID=7, Department="Sales")
])
joined_data = filtered_data.join(other_data, "EmployeeID", "left")
display(joined_data)
In the above code Created two DataFrames dimEmployee and employee_data with sample employee data.
Filtering the data in employee_data that is not in dimEmployee Using the exceptAll
method to filter out rows from employee_data that are present in dimEmployee. The filtered result is stored in the filtered_data DataFrame
union the filtered_data DataFrame with the dimEmployee DataFrame,adding the rows from filtered_data to dimEmployee.
A new DataFrame other_data with some department information for employees.
Join filtered_data with other_data based on the "EmployeeID" column using a left join.
The code performs a left join between filtered_data and other_data on the "EmployeeID" column and stores the result in the joined_data DataFrame.