I have used dask, which internally used pyarrow to save into parquet dataset like below
!ls {dataset_loc}
_common_metadata part.11.parquet part.16.parquet part.5.parquet
_metadata part.12.parquet part.1.parquet part.6.parquet
_metadatabackup part.13.parquet part.2.parquet part.7.parquet
part.0.parquet part.14.parquet part.3.parquet part.8.parquet
part.10.parquet part.15.parquet part.4.parquet part.9.parquet
I want to remove part.16.parquet and changing _metadata accordingly, (so that I can append something new using dask again). I generally referred this as 'top of dataset' management, for efficient production usage. (in production, I only modified top piece of dataset.)
but I tried tons of method, without success. How can I remove part.16.parquet and change _metadata accordingly as if it never existed before.
Furthermore, such operations will be performed a lot of times. (re-reading the entire dataset defeats my purpose). think it instead of part.16.parquet, it is part.8237.parquet, and every parquet is 100MB.
Fastparquet has a method to do this kind of operation. The default action is the remove the data file too at the same time, but you can supply the "remove" function yourself to make that a no-op, which will update the metadata only. It works something like this, which removes the last "row group". Note that you typically have one row group per file (dask does this), but not always. To find which row group you want to remove, you can just print each one and look at the paths each column points to.
pf = fastparquet.ParquetFile(dir)
rg_to_remove = pf.row_groups[-1]
pf.remove_row_groups([rg], remove_with=lambda *_: None)