Search code examples
databasegroup-bydistinctsnowflake-cloud-data-platformpartition-by

SnowFlake's performance on group by vs partition on vs distinct


I have a table in Snowflake. One of the columns in the table is called obj_key(object key). The table size is very large (in TBs), so performance is desirable.

Now, a new entry is added to the table everytime an update on an object is done. The newly inserted row has the same obj_key, but a different entry in the time_modified column. Suppose I want to fetch distinct obj_key from the table, subject to certain conditions.

There are three approaches which I have:

Approach 1:

SELECT obj_key 
FROM my_table
WHERE some_condition
GROUP BY obj_key;

Approach 2:

SELECT distinct(obj_key) 
FROM my_table
WHERE some_condition;

Approach 3:

SELECT obj_key
FROM my_table
WHERE some_condition
QUALIFY ROW_NUMBER() OVER (PARTITION BY obj_key ORDER BY obj_key) = 1;

So essentially my question boils down to these:

I had read that distinct on multiple columns is performed by group_by(col1, col2, ..., col n). So how does the performance differ on the two (if it does)?

Since PARTITION BY also requires an ORDER BY, does it not drastically reduce the performance?

I'd love if someone could provide the nitty-gritties of how these queries run on SnowFlake.


Solution

  • First two queries can be executed with same execution plan, based on cardinality expectation of Snowflake.

    Your third approach will use a window function operator, and it would probably take more time.

    As you have the dataset, I would HIGHLY recommend you to do your own tests, and observe the execution plans and the performance:

    https://docs.snowflake.com/en/user-guide/ui-query-profile.html#how-to-access-query-profile

    Actually, I did some tests with SNOWFLAKE_SAMPLE_DATA database, and I can see that first two queries are executed with same execution plan, and perform better than 3rd query.