Search code examples
python-3.xpandasamazon-s3parquet

pandas to_parquet to s3 url leaves a trail of empty directories interpreted from the s3 url


Below is the code that I ran:

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 5))
df.columns = ['a', 'b', 'c', 'd', 'e']
df['p'] = 2
df.to_parquet('s3://my_bucket/test01/boo.parquet', engine='fastparquet', compression='gzip', partition_cols=['p'])

The parquet is saved to s3. But at my working dir, i now have a dir called "s3:", which has the full structure interpreted from the s3 url.


Solution

  • Ok, i realize that this is a fastparquet quirk.

    This only happens if partition_cols is provided and engine='fastparquet'. If no partition_cols is provided, or if I use default engine (which is engine='pyarrow'), this empty dir artifact will not appear. It just looks like a weird quirk with fastparquet.