apache-spark pyspark apache-spark-sql databricks parquet

Preserve parquet file names in PySpark

I am reading a parquet file with 2 partitions using spark in order to apply some processing, let's take this example

├── Users_data
│ ├── region=eu
      ├── country=france
          ├─- fr_default_players_results.parquet
│ ├── region=na
      ├── country=us
          ├── us_default_players_results.parquet

Is there a way to preserve the same file names (in this case fr_default_players_results.parquet, us_default_players_results.parquet) when writing the parquet back with df.write() ?

Solution

No unfortunately you cannot decide file names with spark because they are automatically generated, however what you can do is to create a column that contains the files names, and then partition by that column, this will create a directory with the filename and inside it the generated files by spark:

df.withColumn("file_name", regexp_extract(input_file_name(), "[^/]*$", 0)).write.partitionBy("region", "country", "file_name").parquet("path/Users_data")

This will create this tree:

├── Users_data
│ ├── region=eu
      ├── country=france
          ├─- file_name=fr_default_players_results.parquet
              ├──part-00...c000.snappy.parquet
│ ├── region=na
      ├── country=us
          ├── file_name=us_default_players_results.parquet
              ├──part-00...c000.snappy.parquet

If you want to go further and really change the names, then you can use hadoop library to loop over your files, copy them to the parent path and rename them using the name of folders generates by spark file_name=....parquet, then delete the folders