Both DoFn
and PTransform
is a means to define operation for PCollection
. How do we know which to use when?
A simple way to understand it is by analogy with map(f)
for lists:
map
applies a function to each element of a list, returning a new list of the results. You might call it a computational pattern.f
is the logic applied to each element.Now, switching to talk about Beam specifics, I think you are asking about ParDo.of(fn)
, which is a PTransform
.
PTransform
is an operation that takes PCollections
as input and yields PCollections
as output. Beam has just five primitive types of PTransform
, encapsulating embarrassingly parallel computational patterns.ParDo
is the computational pattern of per-element computation. It has some variations, but you don't need to worry about that for this question.DoFn
, here I called it fn
, is the logic that is applied to each element.It may also help to think of the fact that you write a DoFn
to say what to do on each element, and the Beam runner provides the ParDo
to apply your logic.