query-optimization snowflake-cloud-data-platform

Snowflake WareHouse Selection

We have our ETL jobs running for sometimes in snowflake. We have two warehouses medium and 2xl. One of the rule that we are following is that if a query runs less than 10 minutes we move to medium warehouse, anything more than that goes to 2XL. But is this the right way to segregate.

I know Snowflake parallizes the query based on core availability. For example

2XL cluster has 32 nodes, 8 cores each. Snowflake tries to split every query to 256 parts. So for example if we run a query:

select sum(a) from mytable;

Then each of the 256 cores scans 1/256 of the table.

But how do I know if a table can be split up into 256 cores, it may not have enough data to be split up. In that case it does not make sense to run the query on 2XL.

Thanks

Solution

This is all rather a subjective question.

If you are running both the medium and the 2XL at the same time, why not run everything on the 2XL and save the Medium, where-as if you spin up/down the 2XL if it's keep up less that 60 seconds you are paying for the 60 seconds. Which for queries that take > 10min in perfectly linear fashion, will take over a 1 minute.

How you know if it can be split is partly theoretical, is it inherently parallel

select id, count(*) from table group by id;

where you have many id, very parallel, even if you have only one id, this is still parallelizable due to counts not conflicting.. where-as

select id, count(distinct column2) ...

needs to build the set of column2 per id, thus not much is to be gained by 32 instances. But still the IO load transforms might be the costly part..

so it depends on the constraints of the query, being run, and the data it working across. Which means you should run the queries on different sized servers to see if for your data loads it scales.