In Redshift, Snowflake, and Azure SQL DW, do we have storage and compute decoupled?
When Compute and Storage were tightly coupled, and when we wanted to scale, we scaled both compute and storage. But under the hoods, was it a virtual machine and we scaled the compute and the VMs disks? Do you guys have maybe some readings on this?
Massive thanks, I am confused now and it would be a blessing if someone could jump in to explain!
You have reason to be confused as there is a heavy layer of marketing being applied in a lot of places. Let's start with some facts:
All databases need local disk to operate. This disk can store permanent versions of the tables (classic locally stored tables and is needed to store the local working set of data for the database to operate. Even in cases where no tables are permanently stored on local disk the size of the local disks is critical as this allows for date fetched from remote storage to be worked upon and cached.
Remote storage of permanent tables comes in 2 "flavors" - defined external tables and transparent remote tables. While there are lots of differences in how these flavors work and how each different database optimizes them they all store the permanent version of the table on disks that are remote from the database compute system(s).
Remote permanent storage comes with pros and cons. "Decoupling" is the most often cited advantage for remote permanent storage. This just means that you cannot fill up the local disks with the storage of "cold" data as only "in use" data is stored on the local disks in this case. To be clear you can fill up (or brown out) the local disks even with remote permanent storage if the working set of data is too large. The downside of remote permanent storage is that the data is remote. Being across a network to some flexible storage solution means that getting to the data takes more time (with all the database systems having their own methods to hide this in as many cases as possible). This also means that the coherency control for the data is also across the network (in some aspect) and also comes with impacts.
External tables and transparent remote tables are both permanently stored remotely but there are differences. An external table isn't under the same coherency structure that a fully-owned table is under (whether local or remote). Transparent remote just implies that the database is working with the remote table "as if" it is locally owned.
VMs don't change the local disk situation. An amount of disk is apportioned to each VM in the box and an amount of local disk is allocated to each VM. The disks are still local, it's just that only a portion of the physical disks are addressable by any one VM.
So leaving fact and moving to opinion. While marketing will tell you why one type of database storage is better than the other in all cases this just isn't true. Each has advantages and disadvantages and which is best for you will depend on what your needs are. The database providers that offer only one data organization will tell you that this is the best option, and it is for some.
Local table storage will always be faster for those applications where speed of access to data is critical and caching doesn't work. However, this means that DBAs will need to do the work to maintain the on-disk data is optimized and fits is the available local storage (for the compute size needed). This is real work and takes time an energy. What you gain in moving remote is the reduction of this work but it comes at the cost of some combination of database cost, hardware cost, and/or performance. Sometimes worth the tradeoff, sometimes not.