We have a table with 120M rows (over 2222 micropartitions), that has 2 important columns, record_id with values in format as prefix|<account_id>|<uuid>
(unique) and column account_id, which has the value of <account_id>
. Note that the prefix is same for all records. Then of course some factum columns, but that is not relevant.
Snowflake shows perfect clustering for the record_id column (automatically chosen by SF, no specified clustering is set by us) via clustering_information function:
"total_partition_count" : 2222,
"total_constant_partition_count" : 2222,
"average_overlaps" : 24.0,
"average_depth" : 25.0,
However, for the column account_id, the clustering is very bad
"total_constant_partition_count" : 0,
"average_overlaps" : 2221.0,
"average_depth" : 2222.0,
There is about 130 distinct account ids, which means that on average, records of one account_id should be over 17 partitions. Even if snowflake clusters by records_id, the beginning of that column (prefix|<account_id>
) correlates with account_id column. So, records with the same account_id should end up in the same partitions. Therefore, I cannot figure out why there is 100% overlap of micropartitions for the account_id column. It is like if snowflake used some weird sorting for the record_id column and thus scattered rows of each account across all partitions. Is that possible?
This has negative consequences on performance, since doing a query with account_id filter results in scan of all partitions.
Note: also asked this question at snowflake forum https://support.snowflake.net/s/question/0D50Z00008vfglCSAQ/2-correlated-columns-have-very-different-clustering-information-one-has-perfect-the-other-has-terrible
In Snowflake's clustering reporting functions like those posted above, there is a limitation that only the first 6 characters of a varchar are considered for assessing clustering depth. So I would not trust the great results reported for record_id since the first 6 characters may be identical due to prefix even if the subsequent account_id's are random.
The best solution would be to explicitly declare clustering on account_id and activate auto-clustering on the table.