Search code examples
clustered-indexsnowflake-cloud-data-platform

Snowflake: 2 correlated columns have very different clustering information (one has perfect, the other has terrible)


We have a table with 120M rows (over 2222 micropartitions), that has 2 important columns, record_id with values in format as prefix|<account_id>|<uuid> (unique) and column account_id, which has the value of <account_id>. Note that the prefix is same for all records. Then of course some factum columns, but that is not relevant.

Snowflake shows perfect clustering for the record_id column (automatically chosen by SF, no specified clustering is set by us) via clustering_information function:

"total_partition_count" : 2222,
 "total_constant_partition_count" : 2222,
 "average_overlaps" : 24.0,
 "average_depth" : 25.0,

However, for the column account_id, the clustering is very bad

 "total_constant_partition_count" : 0,
 "average_overlaps" : 2221.0,
 "average_depth" : 2222.0,

There is about 130 distinct account ids, which means that on average, records of one account_id should be over 17 partitions. Even if snowflake clusters by records_id, the beginning of that column (prefix|<account_id>) correlates with account_id column. So, records with the same account_id should end up in the same partitions. Therefore, I cannot figure out why there is 100% overlap of micropartitions for the account_id column. It is like if snowflake used some weird sorting for the record_id column and thus scattered rows of each account across all partitions. Is that possible?

This has negative consequences on performance, since doing a query with account_id filter results in scan of all partitions.

Note: also asked this question at snowflake forum https://support.snowflake.net/s/question/0D50Z00008vfglCSAQ/2-correlated-columns-have-very-different-clustering-information-one-has-perfect-the-other-has-terrible


Solution

  • In Snowflake's clustering reporting functions like those posted above, there is a limitation that only the first 6 characters of a varchar are considered for assessing clustering depth. So I would not trust the great results reported for record_id since the first 6 characters may be identical due to prefix even if the subsequent account_id's are random.

    The best solution would be to explicitly declare clustering on account_id and activate auto-clustering on the table.