Search code examples
delta-lake

How to convert delta to parquet


Hope it is not a dumb question. I have some use case that would require converting delta to parquet.

The most related answer I got from online discussion is (1) call Vacuum with retention 0 hours to only keep the latest version (2) delete delta_log directory that contains the metadata and transaction logs for delta format.

May I know whether that is normally enough to convert delta to parquet?

I did some online searching and learning and I still have below questions

For parquet format, we have multiple .parquet files, all of them together representing the whole dataset.

For delta, we have multiple "versions" of parquets. Does each of them representing the whole dataset? Or more correctly, each of them contains different state of the dataset (e.g. snapshot). How Vacuum deals with these .parquet files in detail?

Thanks


Solution

  • You can read the delta table into a df, and write it back in parquet format.

    This example uses pyspark code:

    df = spark.read.format("delta").load("/tmp/delta-table")
    df.write.parquet("/tmp/parquet-table.parquet")