I'm trying to optimize a pipeline that pulls messages from PubSubIO and sends those messages to a 3rd party API. One interesting observation I have is that if I put a GroupBy
and a "Degroup" transform after the PubSubIO.read
, the throughput of pipeline increased significantly. I added the GroupBy
just to prevent fusion optimization, and now I wonder how exactly the transform being merged in a given pipeline.
What is the best way to find out how a pipeline looks like after the fusion?
You can access your optimized graph and fused stages by either calling project.locations.jobs.get or via gcloud by running the following command:
gcloud dataflow jobs describe --full $JOB_ID --format json
From the output of the response, the fused stages will be described under the ExecutionStageSummary object within the ComponentTransform array. Below is an example output from the Cloud Pub/Sub to BigQuery Google-provided template. In this case, we can see the graph was fused into 3 steps largely delineated by the Reshuffle
step within the BigQueryIO sink:
Reshuffle
in WriteSuccessfulRecords
and WriteFailedRecords
Reshuffle
in WriteSuccessfulRecords
Reshuffle
in WriteFailedRecords
Since the job description is pretty verbose, you may consider piping the output to jq
to easily extract the relevant bits in a one-line command such as the below:
gcloud dataflow jobs describe --full $JOB_ID --format json | jq '.pipelineDescription.executionPipelineStage[] | {"stage_id": .id, "stage_name": .name, "fused_steps": .componentTransform }'