Search code examples
palantir-foundryfoundry-phonograph

Should I be worried about parquet files being 48MB?


I set a transform to use 2000 shuffle partitions and found that the output files had gone from 200 files (of about 442MB each) to 2000 (of about 48MB each) files. Is this something to be worried about?


Solution

  • Short answer: No, this is probably fine and likely won't cause issues.

    Reducing file size, however, is a fairly cheap operation, which you can achieve by using .coalesce(200) at the end of the transform. This will collapse files together without causing a shuffle. Depending on uniformity of your data, there may be some discrepancy in file sizes. If that will ever become an issue, you can use .repartition(200) instead (this will require a shuffle, increasing the compute cost of your job)