I have a table in Snowflake. One of the columns in the table is called obj_key
(object key).
The table size is very large (in TBs), so performance is desirable.
Now, a new entry is added to the table everytime an update on an object is done. The newly inserted row has the same obj_key
, but a different entry in the time_modified
column. Suppose I want to fetch distinct obj_key
from the table, subject to certain conditions.
There are three approaches which I have:
Approach 1:
SELECT obj_key
FROM my_table
WHERE some_condition
GROUP BY obj_key;
Approach 2:
SELECT distinct(obj_key)
FROM my_table
WHERE some_condition;
Approach 3:
SELECT obj_key
FROM my_table
WHERE some_condition
QUALIFY ROW_NUMBER() OVER (PARTITION BY obj_key ORDER BY obj_key) = 1;
So essentially my question boils down to these:
I had read that distinct
on multiple columns is performed by group_by(col1, col2, ..., col n)
. So how does the performance differ on the two (if it does)?
Since PARTITION BY
also requires an ORDER BY
, does it not drastically reduce the performance?
I'd love if someone could provide the nitty-gritties of how these queries run on SnowFlake.
First two queries can be executed with same execution plan, based on cardinality expectation of Snowflake.
Your third approach will use a window function operator, and it would probably take more time.
As you have the dataset, I would HIGHLY recommend you to do your own tests, and observe the execution plans and the performance:
https://docs.snowflake.com/en/user-guide/ui-query-profile.html#how-to-access-query-profile
Actually, I did some tests with SNOWFLAKE_SAMPLE_DATA database, and I can see that first two queries are executed with same execution plan, and perform better than 3rd query.