Search code examples
apache-spark-sqlapache-iceberg

Altering Iceberg table data using Spark SQL


I am creating tables in Iceberg using Spark with Rest Catalog and saving the data in MinIO. I wish to Update /or Delete any row data based on ID column. There are no direct methods to apply this, so I am using the following methods:

UPDATE TABLE

-- Original Table (default.ns.tbl)
+----+------+------+------+
| ID | COL1 | COL2 | COL3 |
+----+------+------+------+
| 1  | one  | two  | three|
+----+------+------+------+
| 2  | one  | two  | three|
+----+------+------+------+
| 3  | one  | two  | three|
+----+------+------+------+

-- Row to update formed as another table (default.ns.temp)
+----+------+------+------+
| ID | COL1 | COL2 | COL3 |
+----+------+------+------+
| 1  | ten  | ten  | seven|
+----+------+------+------+

--Update command
INSERT INTO default.ns.tbl REPLACE WHERE ID = 1 SELECT * FROM default.ns.temp

--Drop the temporary table
DROP TABLE IF EXISTS default.ns.temp PURGE

DELETE COMMAND

INSERT OVERWRITE default.ns.tbl FROM default.ns.tbl WHERE ID != 1

Now the issue with both Update is that it needs to create a temporary table and delete it, so every update command might become slow because of it; and the issue with Delete is that it still retains the data of earlier snapshots, thus increasing the size of the MinIO bucket.

How can I run Update and Delete commands to run in place?


Solution

  • The Spark SQL documentation is incomplete.

    There are straightforward methods for UPDATE and DELETE, just like regular SQL.