Search code examples
daskparquetpyarrow

pyarrow dask parquet dataset, how do I efficiently remove the last partition and modify _meta_data correctly?


I have used dask, which internally used pyarrow to save into parquet dataset like below

!ls {dataset_loc}
_common_metadata  part.11.parquet  part.16.parquet  part.5.parquet
_metadata     part.12.parquet  part.1.parquet   part.6.parquet
_metadatabackup   part.13.parquet  part.2.parquet   part.7.parquet
part.0.parquet    part.14.parquet  part.3.parquet   part.8.parquet
part.10.parquet   part.15.parquet  part.4.parquet   part.9.parquet

I want to remove part.16.parquet and changing _metadata accordingly, (so that I can append something new using dask again). I generally referred this as 'top of dataset' management, for efficient production usage. (in production, I only modified top piece of dataset.)

but I tried tons of method, without success. How can I remove part.16.parquet and change _metadata accordingly as if it never existed before.

Furthermore, such operations will be performed a lot of times. (re-reading the entire dataset defeats my purpose). think it instead of part.16.parquet, it is part.8237.parquet, and every parquet is 100MB.


Solution

  • Fastparquet has a method to do this kind of operation. The default action is the remove the data file too at the same time, but you can supply the "remove" function yourself to make that a no-op, which will update the metadata only. It works something like this, which removes the last "row group". Note that you typically have one row group per file (dask does this), but not always. To find which row group you want to remove, you can just print each one and look at the paths each column points to.

    pf = fastparquet.ParquetFile(dir)
    rg_to_remove = pf.row_groups[-1]
    pf.remove_row_groups([rg], remove_with=lambda *_: None)