Search code examples
databasesnowflake-cloud-data-platformcloud

Separation of Storage & Compute in Snowflake


One of the key features in Snowflake is the separation of storage and compute.

While I do understand what it means, I realized that I actually don't really grasp what is so special about this. More specifically, I don't see what it means not to have it, at least in the context of cloud databases.

For instance, I read that Redshift does not provide that. What does that imply? What can a software like Redshift not do without it (or does less well) that Snowflake can?

Note: I do not mean to discuss the merits of various solutions, but only their objective differences with regards to one specific feature.


Solution

  • The main takeaway of Snowflake's decoupling of storage and compute is the scalability. Meaning that any amount of users can access the same data without the compute being underutilized in non peak hours. From here

    Cloud infrastructure uniquely enables full elasticity because resources can be added and discarded at any time. That makes it possible to have exactly the resources you need for all users and workloads, but only with an architecture designed to take full advantage of the cloud. Snowflake’s separation of storage, compute, and system services makes it possible to dynamically modify the configuration of the system. Resources can be sized and scaled independently and transparently, on-the-fly. This makes it possible for Snowflake to deliver full elasticity across multiple dimensions:

    Data: The amount of data stored can be increased or decreased at any time. Unlike shared-nothing architectures where the ratio of storage to compute is fixed, the compute configuration is determined independently of the volume of data in the system. This architecture also makes it possible to store data at a very low cost because no compute resources are required to store data in the database.

    Compute: The compute resources being used for query processing can also be scaled up or down at any time as the intensity of the workload on the system changes. Because storage and compute are decoupled, and the data is dynamically distributed, changing compute resources does not require reshuffling the data. Compute resources can be changed on-the-fly, without disruption.