apache-spark pyspark google-cloud-dataflow apache-beam apache-beam-internals

Equivalent of repartition in apache beam

In spark, if we have to reshuffle the data, we can use repartition of a dataframe. What's the way to do the same in apache beam for a pcollection?

In pyspark,

new_df = df.repartition(4)

Solution

From this doc:

You can insert a Reshuffle step. Reshuffle prevents fusion, checkpoints the data, and performs deduplication of records. Reshuffle is supported by Dataflow even though it is marked deprecated in the Apache Beam documentation.

Though I'm not sure if Reshuffle is and still will be supported by other runners of Beam.

The java doc and further explanation of Reshuffle: Apache Beam/Dataflow Reshuffle