Search code examples
apache-sparkpysparksnowflake-cloud-data-platform

How to overwrite a single partition in Snowflake when using Spark connector


Is there a way for Spark to read a single date partition from a Snowflake table, update it and then to overwrite this single date partition. Concurrent writes should be supported. Currently Spark has two overwrite modes, so I don't see a way to overwrite a single partition. It's an easy thing to do in Iceberg, as there overwrite mode overwrites only a single partition. Am I missing out some obvious solution?

I am interested in a solution that would not require transactions or complex SQLs. Ideally, the same behavior as in Iceberg, that is operation that allows to overwrite a single partition


Solution

  • Snowflake doesn’t hold data as files (assuming that you are using standard tables) so in this case your last comment, particularly “Snowflake to overwrite it's old files”, still doesn’t make much sense.

    Assuming your Snowflake tables have a column that indicates a “day” then you can delete these records and load your latest Spark files into Snowflake. If your new set of files is a change to data in the existing records in Snowflake then you can potentially insert/update/delete with a merge statement.

    If you are talking about some type of Snowflake External table (that sits over files held in a cloud storage system) then you can manage those files outside of Snowflake and just refresh the metadata in Snowflake as necessary.

    If you’re just trying to move data from Spark to Snowflake then I’d start by reading this part of the documentation