Search code examples
apache-sparkpysparkparquet

How to get Parquet row groups stats sorted across multiple files with Pyspark?


My process is creating as outcome multiple parquet files. In my specific case 64 as the number of final partitions. Single file is around ~400MB divided by 3 row groups of ~128MB each one. If I look at single file, every row group has own stats min/max which is well distributed per column. For example, given a long column that has for one file (part-00000-*):

- group 1 - min:102 max: 9992
- group 2 - min:9994 max: 21990
- group 3 - min:22098 max: 35764

and the same column on the next file (part-00001-*) with similar distribution

- group 1 - min:99 max: 9882
- group 2 - min:9980 max: 21979
- group 3 - min:22018 max: 32764

for a total of 2 files. What I need is having a distribution like

part-00000-*:
- group 1 - min:99 max: 8662
- group 2 - min:8994 max: 13986
- group 3 - min:14333 max: 19845

part-00001-*
- group 1 - min:19877 max: 25621
- group 2 - min:25654 max: 30091
- group 3 - min:31094 max: 35764

How can I get that when writing files in PySpark? I know I can simply sort by df.orderBy('longColumn') but that affects performance too much so I am looking for an alternative.


Solution

  • Assuming more or less uniform distribution of values within your longColumn, perhaps a repartitionByRange followed by sortWithinPartitions would be a better strategy.

    df.
      repartitionByRange(64,"longColumn").
      sortWithinPartitions("longColumn",ascending=True).
      write.parquet("myFile.parquet")
    

    This way, you'll avoid collecting and sorting the whole dataframe on one executor.