Search code examples
snowflake-cloud-data-platformdbt

The dbt select * idiom and Snowflake


As described in the "How we structure our dbt project" guide, a commonly suggested dbt idiom / best practice is to begin by defining CTEs around the contributing tables, using select * statements. For example:

with 
orders as  (
    select * from {{ ref('stg_jaffle_shop__orders' )}}
),

order_payments as (
    select * from {{ ref('int_payments_pivoted_to_orders') }}
),

orders_and_order_payments_joined as (
    select
        . . . .
    from orders
    left join order_payments on orders.order_id = order_payments.order_id
)
select * from orders_and_payments_joined

This is routinely shown in dbt sample code, even when not all the columns from the contributing tables are ultimately used downstream. The rationale we've found described is that this makes for cleaner, more object-like code.

We are running on Snowflake. I have seen assertions in dbt groups that the select * idiom does not impact performance on that platform, due to under-the-hood query optimization etc., but I'd like to know for sure. We have had trouble benchmarking it ourselves due to caching. Do you know if this is the case, or should we avoid select * here and specify the columns we really want from the very beginning?


Solution

  • The select * idiom in CTEs can cause less than optimal plans when a CTE is referenced more than once in a query.

    The reason is that Snowflake will always "materialize" CTEs when referenced twice. This means that you won't get push-down of predicates into the CTE.

    In contrast to Snowflake, Postgres has the ability to control such Common Table Expression Materialization via it's NOT MATERIALIZED clause.

    A useful property of WITH queries is that they are normally evaluated only once per execution of the parent query, even if they are referred to more than once by the parent query or sibling WITH queries. Thus, expensive calculations that are needed in multiple places can be placed within a WITH query to avoid redundant work

    However, the other side of this coin is that the optimizer is not able to push restrictions from the parent query down into a multiply-referenced WITH query, since that might affect all uses of the WITH query's output when it should affect only one.

    You can override that decision by specifying MATERIALIZED to force separate calculation of the WITH query, or by specifying NOT MATERIALIZED to force it to be merged into the parent query. The latter choice risks duplicate computation of the WITH query, but it can still give a net savings if each usage of the WITH query needs only a small part of the WITH query's full output.

    So, until/unless Snowflake support doing e.g.:

    with 
    orders as  NOT MATERIALIZED (
        select * from {{ ref('stg_jaffle_shop__orders' )}}
    )
    ...
    

    (or, better, until the Snowflake optimizer automatically picks when to "materialize" CTEs or not), then if the rest of your query uses orders more than once and filters it, you should consider repeating the select * in your CTEs or manually moving the filter(s) into the CTE.

    P.S. Snowflake will always prune away unneeded columns in a CTE even if it is referenced more than once. The query plan currently gives the incorrect impression that column pruning does not happen, even though it does occur at execution time.