Is that possible to write the same Parquet folder from different processes in Python?
I use fastparquet
.
It seems to work but I m wondering how it is possible for the _metadata
file to not have conflicts in case two processes write to it at the same it.
Also to make it works I had to use ignore_divisions=True
which is not ideal to get fast performance later when you read the Parquet file right?
Dask consolidates the metadata from the separate processes, so that it only writes the _metadata
file once the rest is complete, and this happens in a single thread.
If you were writing separate parquet files to a single folder using your own multiprocessing setup, each would typically write the single data file and no _metadata
at all. You could either gather the pieces like Dask does, or consolidate the metadata from the data files after they were ready.