pyarrow dataset partitioning by filenames converting filename to field/column name

Is there a way to use the filename in a dataset and have it be the column.

ie if the directory has

file1.parquet file2.parquet file3.parquet

can loading that as a dataset then have a column with the values file1, file2, and file3?

or does it only work with directory names? It seems to only work with directory names, is that right?

Solution

Support for filename-based partitioning will be in Arrow 8.0.0, which will likely release later this month or in May 2022. See ARROW-14612. The same goes for being able to have a column with the filename, see ARROW-15281.

Write null values to Parquet file with Parquet.Net creates an unreadable parquet file
Are Parquet files highly structured or semi structured?
Is there a tool to query Parquet files which are hosted in S3 storage?
Reading multiple Parquet files in PySpark notebook
Difference between Apache parquet and arrow
Inspect Parquet from command line
How do you read a 'date32[day]' type from a parquet file using C++?
How to read a file stored in adls gen 2 using pandas?
Writing to a delta table spark 3.5.3 delta lake 3.2.0
Renamed column is returning null from existing data
How to write Parquet metadata with pyarrow?
How to read a part of parquet dataset into pandas?
NoClassDefFoundError: org/apache/parquet/conf/ParquetConfiguration
Issue reading lists from parquet file into a dataframe showing as None on MacOS but working for Windows
org.apache.parquet.schema.InvalidSchemaException:Cannot write a schema with an empty group
Read parquet file using pandas and pyarrow fails for time values larger than 24 hours
How to split parquet files into many partitions in Spark?
Efficient way to read specific columns from parquet file in spark
Is there a way to directly insert data from a parquet file into PostgreSQL database?
Python: Obtain number of rows for ParquetDataset?
Preserve columns names case in Parquet produced by UNLOAD
PyArrow Dataset filtering not working with partitioned parquet files
Error using sink_parquet in Polars library
Reading parquet file form S3 bucket using nodejs-polars
Parquet without Hadoop?
Updating values in apache parquet file
How to specify file size using repartition() in spark
athena insert and hive format error for HiveIgnoreKeyTextOutputFormat
pyspark overwrite silently failed to remove stale parquet files
pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'