My process is creating as outcome multiple parquet files. In my specific case 64 as the number of final partitions. Single file is around ~400MB divided by 3 row groups of ~128MB each one. If I look at single file, every row group has own stats min/max which is well distributed per column. For example, given a long
column that has for one file (part-00000-*):
- group 1 - min:102 max: 9992
- group 2 - min:9994 max: 21990
- group 3 - min:22098 max: 35764
and the same column on the next file (part-00001-*) with similar distribution
- group 1 - min:99 max: 9882
- group 2 - min:9980 max: 21979
- group 3 - min:22018 max: 32764
for a total of 2 files. What I need is having a distribution like
part-00000-*:
- group 1 - min:99 max: 8662
- group 2 - min:8994 max: 13986
- group 3 - min:14333 max: 19845
part-00001-*
- group 1 - min:19877 max: 25621
- group 2 - min:25654 max: 30091
- group 3 - min:31094 max: 35764
How can I get that when writing files in PySpark? I know I can simply sort by df.orderBy('longColumn')
but that affects performance too much so I am looking for an alternative.
Assuming more or less uniform distribution of values within your longColumn
, perhaps a repartitionByRange
followed by sortWithinPartitions
would be a better strategy.
df.
repartitionByRange(64,"longColumn").
sortWithinPartitions("longColumn",ascending=True).
write.parquet("myFile.parquet")
This way, you'll avoid collecting and sorting the whole dataframe on one executor.