sql join view snowflake-cloud-data-platform where-clause

Snowflake uses where-clause in wrong place when joining

I have views that I need to join, and these have filters to exclude records with invalid data. In the views the filtering is done before the remaing records are cast to their respecive data type, but when joining the views it is not. The join fails on invalid records even though the views standalone do not deliver them.

Here is a very simplified example:

create or replace table SRC (DTM varchar);
insert into SRC values ('2020-01-01'),('00-00-00');

create or replace view SRC_V as 
(
    select DTM::timestamp as DTM 
    from SRC where DTM!='00-00-00'
);

select * from
SRC_V as a 
inner join SRC_V as b
ON a.DTM = b.DTM;

This gives Timestamp '00-00-00' is not recognized so even though the view filters this record out, it is still present in the join.

Unfortunately I can not manipulate the views, and the join code is generated by a code generator so there is not a lot of options to make special code for just this scenario, so I need to understand if this is a bug or by design before we make any major changes to accommodate for this.

I have tried this case in SQL Server where it works as I expect.

Also, there are variants that do work, such as:

with x as (select DTM from SRC_V)
select * from
x as a 
inner join x as b
ON a.DTM = b.DTM;

If the behaviour is by design, I feel that this example should also fail. It also works with full outer but not with left or right outer which also feels strange.

Solution

This is by design. Because you're performing a self-join (join on the same table), the optimizer knows that it can apply the where clause predicate on both scans. Notice the plan has the filter for !='00-00-00' twice, once in Filter2 and once in Filter4:

Applying a where clause predicate on a scan can reduce (prune) the number of micropartitions Snowflake has to scan. Pruning by where clause happens before execution starts and is usually more effective at pruning than the next operator up the plan, which is JoinFilter [5].

As far as why this does not happen with a CTE, it's a completely different plan. By the time the join happens the rows are already scanned and passed through the filter:

However, using a CTE is not a guarantee that it will always work this way. The Snowflake optimizer can and will change plans based on the warehouse size and data volumes. It's possible that while this plan works at small scale, running it with much higher row counts could form a plan that encounters the same filtering problem as the view.

You mentioned that something is generating the code and that you may not have access to it, but the safest way to handle this situation is to change how the view is casting the varchar to a timestamp. Use of the :: notation for casting does not handle errors. You can have someone who can change the code and/or views to use something that will handle errors, something like this:

create or replace view SRC_V as 
(
    select try_to_timestamp(DTM) as DTM 
    from SRC where DTM!='00-00-00'
);

This will convert the 00-00-00 values to database null, and because null does not equal anything including database null, it will not join these rows.

Edit: I would also be careful about aliasing computed columns to the same names as an existing column in the table, particularly when joining on that column. While it works in this simple example, it could present problems. Here's a KB article discussing it:

https://community.snowflake.com/s/article/Beware-with-Column-Aliases-in-Joins

It will be safer to change the alias to a different name than the original column and use that alias in the join condition.