I am reading a parquet file with 2 partitions using spark in order to apply some processing, let's take this example
├── Users_data
│ ├── region=eu
├── country=france
├─- fr_default_players_results.parquet
│ ├── region=na
├── country=us
├── us_default_players_results.parquet
Is there a way to preserve the same file names (in this case fr_default_players_results.parquet
, us_default_players_results.parquet
) when writing the parquet back with df.write()
?
No unfortunately you cannot decide file names with spark because they are automatically generated, however what you can do is to create a column that contains the files names, and then partition by that column, this will create a directory with the filename and inside it the generated files by spark:
df.withColumn("file_name", regexp_extract(input_file_name(), "[^/]*$", 0)).write.partitionBy("region", "country", "file_name").parquet("path/Users_data")
This will create this tree:
├── Users_data
│ ├── region=eu
├── country=france
├─- file_name=fr_default_players_results.parquet
├──part-00...c000.snappy.parquet
│ ├── region=na
├── country=us
├── file_name=us_default_players_results.parquet
├──part-00...c000.snappy.parquet
If you want to go further and really change the names, then you can use hadoop library to loop over your files, copy them to the parent path and rename them using the name of folders generates by spark file_name=....parquet, then delete the folders