Search code examples
apache-sparkdelta-lakeapache-iceberg

Is there a command to convert existing parquet data to Iceberg table in place?


Delta Lake has the capability of transforming existing parquet data to a delta table, by "simply" adding its own metadata - the _delta_log file.

https://docs.delta.io/2.2.0/delta-utility.html#convert-a-parquet-table-to-a-delta-table

-- Convert partitioned Parquet table at path '<path-to-table>' and partitioned by integer columns named 'part' and 'part2'
CONVERT TO DELTA parquet.`<path-to-table>` PARTITIONED BY (part int, part2 int)

That is really convenient since it's a zero-copy operation (I suppose my understanding is right based on the source code here).

Does Iceberg share the same feature?


Solution

  • The nearest equivalent to Delta Lake's convertToDelta method, described here, is Iceberg's migrate. Iceberg also has an add_files method which attempts to directly add files from a Hive or file based table into a given Iceberg table. This method should be used with care, taken from iceberg docs:

    This procedure will not analyze the schema of the files to determine if they actually match the schema of the Iceberg table. Upon completion, the Iceberg table will then treat these files as if they are part of the set of files owned by Iceberg. This means any subsequent expire_snapshot calls will be able to physically delete the added files.

    This could create inconsistencies if those files are owned by another metastore. This isn't an issue if you're planning a one way migration off of the original metastore. In that scenario it should be an efficient conversion.

    edit: I should also mention Iceberg's snapshot feature which allows you to test a migration in a lightweight way before converting.