caching snowflake-cloud-data-platform shared-nothing

Snowflake - Caching with Data Freshness

I've read a lot about snowflake caching, but do not understand how snowflake gets around data freshness problem encountered by other "shared everything architectures". My understanding is that, with shared everything when the data changes, there essentially needs to be some locking.

As explained in this article:

For efficiency both nodes have cached local copies of record 1 in memory. A client then tries to update record 1 so that ‘foo’ becomes ‘bar’. To do this in a consistent manner the DBMS must take a distributed lock on all nodes that may have cached record 1. Such distributed locks become slower and slower as you increase the number of machines in the cluster and as a result can impede the scalability of the writing process.

In other words, if snowflake is caching data in the compute layer, but the underlying data changes in S3, doesn't snowflake run into the same caching issues of other shared everything architectures? I get that in an analytics database, there are fewer updates, but then why wouldn't traditional shared-everything architectures work?

The only answer I've seen is that snowflake has a query optimizer that checks to see if the underlying data has changed. But still don't understand this/how this is a significant upgrade over caching strategy by traditional shared-everything DBs.

Solution

I believe the key concept that you are missing here is that Snowflake never updates a record...ever. It instead creates a new Micro-Partition that contains all the data from the old Micro-Partition with the changes applied. In this way, the services layer of Snowflake immediately knows that a Micro-Partition in a warehouse cache is no longer valid, because it's no longer active according to the services layer.

The only time you'd need a lock is if you were "updating" the cache or the underlying data. That never happens here. The new micro-partitions become active as soon as the update statement is complete. While the update is running, there are no "dirty" states of the data, so any query that runs against that table will always either be the previous state while the update is running or the post-update state. No locks needed.

Does that answer your question?