I am creating tables in Iceberg using Spark with Rest Catalog and saving the data in MinIO. I wish to Update /or Delete any row data based on ID column. There are no direct methods to apply this, so I am using the following methods:
UPDATE TABLE
-- Original Table (default.ns.tbl)
+----+------+------+------+
| ID | COL1 | COL2 | COL3 |
+----+------+------+------+
| 1 | one | two | three|
+----+------+------+------+
| 2 | one | two | three|
+----+------+------+------+
| 3 | one | two | three|
+----+------+------+------+
-- Row to update formed as another table (default.ns.temp)
+----+------+------+------+
| ID | COL1 | COL2 | COL3 |
+----+------+------+------+
| 1 | ten | ten | seven|
+----+------+------+------+
--Update command
INSERT INTO default.ns.tbl REPLACE WHERE ID = 1 SELECT * FROM default.ns.temp
--Drop the temporary table
DROP TABLE IF EXISTS default.ns.temp PURGE
DELETE COMMAND
INSERT OVERWRITE default.ns.tbl FROM default.ns.tbl WHERE ID != 1
Now the issue with both Update
is that it needs to create a temporary table and delete it, so every update command might become slow because of it; and the issue with Delete
is that it still retains the data of earlier snapshots, thus increasing the size of the MinIO bucket.
How can I run Update
and Delete
commands to run in place?
The Spark SQL documentation is incomplete.
There are straightforward methods for UPDATE and DELETE, just like regular SQL.