After particular time on solving task and reviewing Snowflake documentation I noticed potential improvement in query for readability and possibly performance improvement. My query is using nested correlated query to check it there is any existing updates for main table using separate table with changes. Both tables don't have explicit PK or any other constraints on allowed values. Here is an example of simplified query:
SELECT a.*
FROM tableA a
WHERE EXISTS (
SELECT 1
FROM tableA_CDC a_cdc
WHERE a.column1 = a_cdc.column1
AND a.column2 = a_cdc.column2
AND (a.column3 = a_cdc.column3 OR (a.column3 IS NULL AND a_cdc.column3 IS NULL))
)
I was interested in the last predicate (a.column3 = a_cdc.column3 OR (a.column3 IS NULL AND a_cdc.column3 IS NULL)). For column3 value can be null, so we want to fetch rows from main table. Column1 and column2 cannot have null values and we can ignore null handling.
The problem I found was not only in readability, but as I noticed in performance. Basically if we compare only by '=' or checking if both columns are NULLs - everything works fine (using query profile). The sum of counts of data by each predicate gives correct result. But if we have grouped condition on equals or is null, then we have correct changes count, but query profile shows that full table scan was performed.
In documentation I found function called 'EQUAL_NULL', that allow null-safely compare two expressions. If I modify query by replacing last grouped predicate with EQUAL_NULL, then result is correct and there is no full table scan.
SELECT a.*
FROM tableA a
WHERE EXISTS (
SELECT 1
FROM tableA_CDC a_cdc
WHERE a.column1 = a_cdc.column1
AND a.column2 = a_cdc.column2
AND EQUAL_NULL(a.column3, a_cdc.column3)
)
Any ideas why we have full table scan in first case?
I had this problem too. It's hard to know exactly what's going on with your query without some information from the query profile. However, when I experienced this problem, the OR in my WHERE clause was acting almost like a cartesian join. Because I was comparing the columns from two different tables to each other (similar to a join), the query appeared to be comparing the value from one table to all the values in the other (similar to the issues experienced in a query like the one in this post). I solved this by unioning the tables together like this:
SELECT 1
FROM tableA_CDC a_cdc
WHERE a.column1 = a_cdc.column1
AND a.column2 = a_cdc.column2
AND a.column3 = a_cdc.column3
UNION ALL
SELECT 1
FROM tableA_CDC a_cdc
WHERE a.column1 = a_cdc.column1
AND a.column2 = a_cdc.column2
AND a.column3 IS NULL AND a_cdc.column3 IS NULL
This drastically improved my own query, and it mirrors the answer from the link I've shared above (plus a lot of others out there).
Also, just thinking about the function of this table, I always think of streams when I think of checking for table updates. Not sure how familiar you are with them, but depending on what you're going to do with these updates, I highly recommend streams and tasks.