Specifically, does the Flatten
PTransform in Beam perform any sort of:
Or does it just "merge" two different PCollections?
The Flatten
transform does not do any sort of deduplication, or filtering of any kind. As mentioned, it simply merges the multiple PCollections into one that contains the elements of each of the inputs.
This means that:
with beam.Pipeline() as p:
c1 = p | "Branch1" >> beam.Create([1, 2, 3, 4])
c2 = p | "Branch2" >> beam.Create([4, 4, 5, 6])
result = (c1, c2) | beam.Flatten()
In this case, the result
PCollection contains the following elements: [1, 2, 3, 4, 4, 4, 5, 6]
.
Note how the element 4
appears once in c1
, and twice in c2
. This is not deduplicated, filtered or removed in any way.
As a curious fact about Flatten
, some runners optimize it away, and simply add the downstream transform in both branches. So, in short, no special filtering or dedups. Simply merging of PCollections.