Search code examples
pythonpython-3.xgoogle-cloud-dataflowapache-beam

Apache Beam Map, DoFn and Composite Transform


I want to understand the difference is use cases between a Map function, a DoFn called from Pardo and a Composite transform.

I could achieve the same results with the below code for a list of transformations that I need to do for my pipeline. I made a sample of what I mean by multiple stages.

import apache_beam as beam

def myTransform(line):
   line = line * 10
   line = line + 5
   line = line - 2
   return line

class myPTransform(beam.PTransform):
  def expand(self, pcoll):
    # return pcoll | beam.Map(myTransform)
    pcol_output = (pcoll 
                   | beam.Map(lambda line: line * 10)
                   | beam.Map(lambda line: line + 5)
                   | beam.Map(lambda line: line - 2)
    )
    return pcol_output

class mydofunc(beam.DoFn):
   def process(self, element):
    element = element * 10
    element = element + 5
    element = element - 2
    yield element
   
with beam.Pipeline() as p:
    lines = p | beam.Create([1,2,3,4,5])
    
    ### Map Function
    manual = (lines
              | "Map function" >> beam.Map(myTransform)
              | "Print map" >> beam.Map(print))
    
    ### Composite Ptransform
    ptrans = (lines
              | "ptransform call" >> myPTransform()
              | "Print ptransform" >> beam.Map(print))
    
    ### Do Function
    dofnpcol = (lines
              | "Dofn call" >> beam.ParDo(mydofunc())
              | "Print dofnpcol" >> beam.Map(print))
    
    

On what scenarios should I use a DoFn and a Composite Transform? I might be missing a bigger picture here for the difference between these 3 options. Any insights would be really helpful.

I saw a question on Apache Beam: DoFn vs PTransform


Solution

  • For this operation, the transformation is done only inside a worker :

       ### Map Function
        manual = (lines
                  | "Map function" >> beam.Map(myTransform)
                  | "Print map" >> beam.Map(print))
    

    For this second operation, the 3 transformations can be done on multiple workers :

        ### Composite Ptransform
        ptrans = (lines
                  | "ptransform call" >> myPTransform()
                  | "Print ptransform" >> beam.Map(print))
    

    In this last operation, the result is exactly the same as the first example with Map. Behind the Map operator uses a DoFn, but with the possibility to directly pass a function (or lambda) as parameter. It's interesting to remove boilerplate code and directly call a transformation from an element to another element :

        
        ### Do Function
        dofnpcol = (lines
                  | "Dofn call" >> beam.ParDo(mydofunc())
                  | "Print dofnpcol" >> beam.Map(print))
    

    The choice depends on different criteria, in your example the transformation isn't expensive and you can use the first and third example.

    But in real-world application, big volume and more complexe operations, you can choose the second approach.

    The composite PTransform allows to assemble several transformations together and this kind of class and design allow a reusability in other places of the pipeline or in other pipelines.