google-cloud-platform google-cloud-dataflow

What is the best way to see how Dataflow is doing fusion optimization

I'm trying to optimize a pipeline that pulls messages from PubSubIO and sends those messages to a 3rd party API. One interesting observation I have is that if I put a GroupBy and a "Degroup" transform after the PubSubIO.read, the throughput of pipeline increased significantly. I added the GroupBy just to prevent fusion optimization, and now I wonder how exactly the transform being merged in a given pipeline.

What is the best way to find out how a pipeline looks like after the fusion?

Solution

You can access your optimized graph and fused stages by either calling project.locations.jobs.get or via gcloud by running the following command:

gcloud dataflow jobs describe --full $JOB_ID --format json

From the output of the response, the fused stages will be described under the ExecutionStageSummary object within the ComponentTransform array. Below is an example output from the Cloud Pub/Sub to BigQuery Google-provided template. In this case, we can see the graph was fused into 3 steps largely delineated by the Reshuffle step within the BigQueryIO sink:

S03 - All transforms before the Reshuffle in WriteSuccessfulRecords and WriteFailedRecords
S02 - All transforms after the Reshuffle in WriteSuccessfulRecords
S01 - All transforms after the Reshuffle in WriteFailedRecords

Full Output

Since the job description is pretty verbose, you may consider piping the output to jq to easily extract the relevant bits in a one-line command such as the below:

gcloud dataflow jobs describe --full $JOB_ID --format json | jq '.pipelineDescription.executionPipelineStage[] | {"stage_id": .id, "stage_name": .name, "fused_steps": .componentTransform }'