Search code examples
apache-beamdataflowpprof

CPU profiling not covering all the vCPU time of Apache Beam pipeline on Dataflow


Our pipeline is developed based on the Apache Beam Go SDK. I'm trying to profile the CPU of all workers by setting the flag --cpu_profiling=gs://gs_location: https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/runners/dataflow/dataflow.go

The job finished with spending 16.636 vCPU hr and a maximum number of 104 workers: enter image description here

As a result in the specified GCS location, a bunch of files are recorded with name "profprocess_bundle-*": Saved profiling files

Then I downloaded these files, unzipped them all, and visualize the results with pprof (https://github.com/google/pprof): enter image description here

So here are my questions:

  1. How is the total time in the profiling result collected? The sampled time (1.06 hrs) is way shorter than the vCPU time (16.626 hrs) reported by Dataflow.

  2. What is the the number in the file name "profprocess_bundle-*"? I was thinking it may correspond to the index of a worker. But the maximum of the number is larger than the worker number, and the number is not continuous. The maximum number is 122, but there are only 66 files.


Solution

  • when you set --cpu_profiling=true, the profiling starts when the SDK worker starts processing a bundle (a batch of input elements will go through a subgraph of your pipeline DAG, sometimes also referred as work item) and ends when the processing finishes. A job can contain many bundles. That's why the total vCPU time will be larger than the sample period.

    As mentioned the number in profprocess_bundle-* is representing the bundle id being profiled.