When using dask.to_parquet(df, filename)
a subfolder filename
is created and several files are written to that folder, whereas pandas.to_parquet(df, filename)
writes exactly one file.
Can I use dask's to_parquet
(without using compute()
to create a pandas df) to just write a single file?
Writing to a single file is very hard within a parallelism system. Sorry, such an option is not offered by Dask (nor probably any other parallel processing library).
You could in theory perform the operation with a non-trivial amount of work on your part: you would need to iterate through the partitions of your dataframe, write to the target file (which you keep open) and accumulate the output row-groups into the final metadata footer of the file. I would know how to go about this with fastparquet, but that library is not being much developed any more.