Search code examples
pythonpandasdaskdask-distributed

Dask to_parquet throws exception "No such file or directory"


The following Dask code attempts to store a dataframe in parquet, read it again, add a column, and store again the dataframe with the column added.

This is the code:

import pandas as pd
import dask.dataframe as dd

pdf = pd.DataFrame({
    'height': [6.21, 5.12, 5.85],
    'weight': [150, 126, 133]
})

ddf = dd.from_pandas(pdf, npartitions=3) 
ddf.to_parquet('C:\\temp\\test3', engine='pyarrow', overwrite=True)
ddf2 = dd.read_parquet('C:\\temp\\test3') 
ddf2['new_column'] = 1
ddf2.to_parquet('C:\\temp\\test3', engine='pyarrow', overwrite=True) # <- this one fails

The error I get is:

FileNotFoundError: [Errno 2] No such file or directory: 'C:/temp/test3/part.0.parquet'

If I check directory temp3 is empty.

I think that when the second to_parquet is executed, since overwrite=True it does an implicit compute() and the process starts in the read_parquet, but since overwrite deleted the files it doesn't find it. Is that the case?

In any way, how to make this work? Note that in the real scenario the dataframe doesn't fit in memory.

UPDATE

I'm not trying to update the parquet file, I need to write it again overwriting the existing one.


Solution

  • This works, use a different file name when you do a to_parquet and then delete the old parquet directory:

    ddf = dd.from_pandas(pdf, npartitions=3) 
    ddf.to_parquet('C:\\temp\\OLD_FILE_NAME', engine='pyarrow', overwrite=True)
    ddf2 = dd.read_parquet('C:\\temp\\OLD_FILE_NAME') 
    ddf2['new_column'] = 1
    ddf2.to_parquet('C:\\temp\\NEW_FILE_NAME', engine='pyarrow', overwrite=True)
    
    path_to_delete = os.path.dirname('C:\\temp\\OLD_FILE_NAME\\') 
    shutil.rmtree(path_to_delete)