We are appending data to an existing parquet dataset stored in S3 (partitioned) by using pyarrow. This runs on AWS lambda several times per hour. A minimal example would be:
import pyarrow as pa
import pyarrow.parquet as pq
import s3fs
df = ... # Existing pandas df
table = pa.Table.from_pandas(df)
pq.write_to_dataset(
table,
filesystem=s3,
root_path=f"s3://s3-path/",
partition_cols=['year', "month"]
)
As a result, a number of parquet files will be written to S3 depending on the internal data values. Our aim is to track which files have been written to the filesystem by outputting their resulting filename (S3 key).
Is there any way to capture the actual filename that is written by pyarrow
or s3fs
? Parquet file names are arbitrarily named according to the computed hash name and I do not see any logging functionality for neither of the two packages mentioned.
Starting 0.15.0 you can provide names as partition_filename_cb
for your files before writing.
pyarrow.parquet.write_to_dataset(table, root_path, partition_cols=None, partition_filename_cb=None, filesystem=None, **kwargs)