Hope it is not a dumb question. I have some use case that would require converting delta to parquet.
The most related answer I got from online discussion is (1) call Vacuum
with retention 0 hours to only keep the latest version (2) delete delta_log
directory that contains the metadata and transaction logs for delta format.
May I know whether that is normally enough to convert delta to parquet?
I did some online searching and learning and I still have below questions
For parquet format, we have multiple .parquet
files, all of them together representing the whole dataset.
For delta, we have multiple "versions" of parquets. Does each of them representing the whole dataset? Or more correctly, each of them contains different state of the dataset (e.g. snapshot). How Vacuum
deals with these .parquet
files in detail?
Thanks
You can read the delta table into a df, and write it back in parquet format.
This example uses pyspark code:
df = spark.read.format("delta").load("/tmp/delta-table")
df.write.parquet("/tmp/parquet-table.parquet")