Search code examples
sqloptimizationsnowflake-cloud-data-platformsql-delete

Is it more efficient to delete rows or re-create a table on Snowflake


I have a table with 600M rows, and I want to delete a subset of those rows with ids coming from another table.

Will it be more efficient to delete the rows or re-create the table?

(Based on a discussion on the dbt Slack)


Solution

  • It depends. Sometimes DELETE is faster, sometimes CREATE OR REPLACE table is the best option.

    The mental model we have to use is that all the data in a table lives in multiple Snowflake "micro-partitions". If my DELETE is going to touch only one of those micro-partitions - either because I'm deleting only one row, or because all the rows I'm deleting are well clustered into a single micro-partition - then DELETE is faster.

    But if my DELETE is going to touch multiple micro-partitions of a table - I'd rather rebuild the whole table again.

    For example, let's setup a clone of TPC-H-SF100:

    create or replace table lineitem_100
    as
    select *
    from snowflake_sample_data.tpch_sf100.lineitem
    order by l_shipdate
    ;
    
    select count(*)
    from lineitem_100
    -- 600,037,902
    ;
    
    create table lineitem_100b
    clone lineitem_100;
    
    create table lineitem_100c
    clone lineitem_100;
    

    Deleting a random set of rows takes 25s:

    use warehouse fh_3xl
    ;
    delete from lineitem_100b
    where l_orderkey in (
        select o_orderkey
        from orders_100
        where o_totalprice between 50000 and 120000
    )
    -- 25s
    

    But re-creating the table without those rows takes half the time instead:

    insert overwrite into lineitem_100c
    select *
    from lineitem_100
    where l_orderkey not in (
        select o_orderkey
        from orders_100
        where o_totalprice between 50000 and 120000
    )
    order by l_shipdate
    -- 13s
    

    You can see that when re-creating the table I'm also taking care of clustering it well with the order by.

    It's also interesting to note that I did a insert overwrite instead of create or replace table - this way we get to keep all the policies and other metadata we could have applied to the table.

    Delete:

    enter image description here

    Recreate:

    enter image description here