Search code examples
sqlapache-sparkdatabricksdatabricks-sql

Databricks SQL - CTE namespace (bug?) with temporary views


Bugs and sharp corners are part of life. I'm not overly concerned that it's happening. More concerned with how I can figure out why because this means there's something how spark parses queries that I fundamentally don't understand. Moreover, I can't figure any reason why you'd deliberately want to do the behavior that forces this error. But it's easy to accidentally do (define an unused CTE) and it leaves you with an error that is a nightmare to figure out.

Is there anything beyond "explain yada_yada" that I can do to further look under the hood of failed sql commands?

Screenshots are provided just for proof that I'm not losing my mind, pertinent code included as plain text as well.

Runtime: DBR 13.3 LTS

Spark Version: 3.4.1

Set Up and Summary: N temporary views which run successfully in isolation. Final dependent temporary view. Same behavior is observed using global or local temp views.

When the primary temporary views are referenced in CTE's in the final dependent view, all CTE's must be in the final select's namespace (i.e. joined to), or the primary view falls out of scope of the namespace completely.

create or replace global temporary view gtv_1 as 
   select 1 as dummy_1;

create or replace global temporary view gtv_2 as 
   select 1 as dummy_2;

This works (note the superfluous join to cte2 despite not being in the select):

create or replace global temporary view gtv_3 as
    with cte1 as (
      select dummy_1
        from global_temp.gtv_1
    ),
    cte2 as (
      select dummy_2 
        from global_temp.gtv_2
    )
    select cte1.*
      from cte1
     inner 
      join cte2
        on cte1.dummy_1 = cte2.dummy_2;
select * 
  from global_temp.gtv_3;

dummy_1
1

This does not work: Note the absence of the superfluous join

create or replace global temporary view gtv_3 as
    with cte1 as (
      select dummy_1
        from global_temp.gtv_1
    ),
    cte2 as (
      select dummy_2 
        from global_temp.gtv_2
    )
    select cte1.*
      from cte1;
select * 
  from global_temp.gtv_3;

Fails with error: [TABLE_OR_VIEW_NOT_FOUND] The table or view global_temp.gtv_2 cannot be found. Verify the spelling and correctness of the schema and catalog. If you did not qualify the name with a schema, verify the current_schema() output, or qualify the name with the correct schema and catalog. To tolerate the error on drop use DROP VIEW IF EXISTS or DROP TABLE IF EXISTS.; line 7 pos 13

explain select * from global_temp.gtv_3;
Issue Description
Error Type Query Planning Error
Error Message [TABLE_OR_VIEW_NOT_FOUND] The table or view global_temp.gtv_2 cannot be found. Verify the spelling and correctness of the schema and catalog. If not qualified, check the current_schema() output or qualify the name with the correct schema and catalog. To tolerate the error on drop, use DROP VIEW IF EXISTS or DROP TABLE IF EXISTS.
Error Location Line 7, Position 13

enter image description here

EDIT:

Traced it to this configuration option

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L3872

Won't pretend that I'm qualified to actually explain why this matters. Looks like behavior that was introduced solving something else with nested CTE naming. But just looking at the code I'm satisfied that CTE references are sufficiently complex that oddities are understandable.

So this works, with LEGACY and only LEGACY.

%sql

set spark.sql.legacy.ctePrecedencePolicy = LEGACY; -- LEGACY works. Switching to CORRECTED or EXCEPTION does not.

create or replace temporary view gtv_1 as 
   select 1 as dummy_1;

create or replace temporary view gtv_2 as 
   select 1 as dummy_2;

create or replace temporary view gtv_3 as
    with cte1 as (
      select dummy_1
        from gtv_1
    ),
    cte2 as (
      select dummy_2 
        from gtv_2
    )
    select cte1.*
      from cte1;
select * 
  from gtv_3

Solution

  • Moving edit to an answer

    This configuration option influences the name scope of cte's.

    https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L3872

    Can't explain how exactly that code works, but keeping track of cte references looks complex enough that they can be forgiven for edge cases.

    This works, with LEGACY and only LEGACY.

    %sql
    
    set spark.sql.legacy.ctePrecedencePolicy = LEGACY; -- LEGACY works. Switching to CORRECTED or EXCEPTION does not.
    
    create or replace temporary view gtv_1 as 
       select 1 as dummy_1;
    
    create or replace temporary view gtv_2 as 
       select 1 as dummy_2;
    
    create or replace temporary view gtv_3 as
        with cte1 as (
          select dummy_1
            from gtv_1
        ),
        cte2 as (
          select dummy_2 
            from gtv_2
        )
        select cte1.*
          from cte1;
    select * 
      from gtv_3