Search code examples
indexingsnowflake-cloud-data-platformcolumnstoreclustering-key

Should most Snowflake tables have a cluster key defined?


I'm new to Snowflake relatively. I get that it's a columnar based database. That's not exactly magical - it's just better at more Analytics / BI workloads as opposed to more writing transactions.

So it doesn't "have indexes" - but it has clustering, micro-partitions, and search optimizations -- which seem very similar to indexes in a way.

It also seems to automatically collect stats - at least about what data ranges it has written to various partitions. Not sure if it collects 'common query' stats and uses that to optimize under the hood.

I mean call it an index, call it a "cluster" --- call it a "map" -- clearly it has indexes. A database without indexes is like a Library that is just a pile of 1 million books in a mountain. Unusuable.

So anyway -- is it really "smart" enough to auto-define under-the-hood partitioning and indexing --- or is it usually a good idea to provide a cluster key to it by predicting the workload?

For instance, let's take one of the most common use cases for Analytics. You have 5 years worth of data --- but a user often only wants to look at 1-2 continuous months at a time. Seems logical it would be best to partition/ cluster the data based on event_date or whatever (despite high cardinality). -- But I must admit, I haven't fully wrapped my head around columnar databases. ... Even if a column is say "order value" -- it would still be best to be organized by date to speed up retrieval?


Solution

  • As you say, the "best" is usually to have the data sorted in a way that most of your queries will prune unneeded data. This is usually seen as "clustering by timestamp".

    But: You don't need to ask Snowflake to cluster by timestamp if you are adding data every day. Then the data is naturally stored by timestamp, so there's no need to add a secondary process "re-sorting" data when needed.

    What about ingesting archived data into Snowflake? Instead of asking Snowflake to cluster it, you can ingest it sorted - and thus it will be ready for your queries based on timestamp to prune it.

    As in:

    create table `my_table`
    as (
      select *
      [...]
      order by timestamp
    )
    

    So when would you use clustering, instead of the natural sorting order: