google-cloud-dataflow apache-flink apache-beam

Does Flatten have any effects other than flattening collections element-wise?

Specifically, does the Flatten PTransform in Beam perform any sort of:

Deduplication
Filtering
Purging of existing elements

Or does it just "merge" two different PCollections?

Solution

The Flatten transform does not do any sort of deduplication, or filtering of any kind. As mentioned, it simply merges the multiple PCollections into one that contains the elements of each of the inputs.

This means that:

with beam.Pipeline() as p:
    c1 = p | "Branch1" >> beam.Create([1, 2, 3, 4])
    c2 = p | "Branch2" >> beam.Create([4, 4, 5, 6])

    result = (c1, c2) | beam.Flatten()

In this case, the result PCollection contains the following elements: [1, 2, 3, 4, 4, 4, 5, 6].

Note how the element 4 appears once in c1, and twice in c2. This is not deduplicated, filtered or removed in any way.

As a curious fact about Flatten, some runners optimize it away, and simply add the downstream transform in both branches. So, in short, no special filtering or dedups. Simply merging of PCollections.