"With as" SQL clause, also referred as CTE, is known as a handy way to improve readability of the queries. As far as I understand in classic SQL CTE results are NOT cached, and each CTE is just replaced by relevant SQL, and not by resultset of relevant SQL.
So given I have a CTE in my Spark SQL:
with myCTE as (select * from someTable)
The resultSet from my CTE wont be cached and each time I will select from it - the table will be read by Spark again. Correct?
You can control will CTE be materialized in Hive or not and cache it in Spark.
In hive the result of CTE can be materialized before main query execution, by default this feature is disabled, use this configuration parameter:
set hive.optimize.cte.materialize.threshold=1;
If the number of references to a CTE clause exceeds this threshold, Hive will materialize it before executing the main query block. -1
will disable this feature.
In Spark SQL you can cache table and use it multiple times in other queries.