Search code examples
parquetpyarrow

pyarrow dataset partitioning by filenames converting filename to field/column name


Is there a way to use the filename in a dataset and have it be the column.

ie if the directory has

file1.parquet file2.parquet file3.parquet

can loading that as a dataset then have a column with the values file1, file2, and file3?

or does it only work with directory names? It seems to only work with directory names, is that right?


Solution

  • Support for filename-based partitioning will be in Arrow 8.0.0, which will likely release later this month or in May 2022. See ARROW-14612. The same goes for being able to have a column with the filename, see ARROW-15281.