Search code examples
apache-sparkpysparkgoogle-cloud-dataflowapache-beamapache-beam-internals

Equivalent of repartition in apache beam


In spark, if we have to reshuffle the data, we can use repartition of a dataframe. What's the way to do the same in apache beam for a pcollection?

In pyspark,

new_df = df.repartition(4)

Solution

  • From this doc:

    You can insert a Reshuffle step. Reshuffle prevents fusion, checkpoints the data, and performs deduplication of records. Reshuffle is supported by Dataflow even though it is marked deprecated in the Apache Beam documentation.

    Though I'm not sure if Reshuffle is and still will be supported by other runners of Beam.

    The java doc and further explanation of Reshuffle: Apache Beam/Dataflow Reshuffle