Search code examples
google-cloud-platformworkflowgoogle-cloud-dataproc

GCP Dataproc parallel steps execution


I am creating dataproc cluster on GCP using a workflow template from YAML files. Once the cluster is created all the steps start executing in parallel but I want some steps to execute after all other steps have completed execution. is there any way to achieve this?

sample YAML used for cluster creation

jobs:
- pigJob:
    continueOnFailure: true
    queryList:
      queries:
      - sh /ui.sh
  stepId: run-pig-ui
- pigJob:
    continueOnFailure: true
    queryList:
      queries:
      - sh /hotel.sh
  stepId: run-pig-hotel

placement:
  managedCluster:
    clusterName: cluster-abc
    labels:
      data: cluster
    config:
      configBucket: bucket-1
      initializationActions:
        - executableFile: gs://bucket-1/install_git.sh
          executionTimeout: 600s
      gceClusterConfig:
        zoneUri: asia-south1-a
        tags:
          - test
      masterConfig:
        machineTypeUri: n1-standard-8
        diskConfig:
          bootDiskSizeGb: 50
      workerConfig:
        machineTypeUri: n1-highcpu-32
        numInstances: 2
        diskConfig:
          bootDiskSizeGb: 100
      softwareConfig:
        imageVersion: 1.4-ubuntu18
        properties:
          core:io.compression.codec.lzo.class: com.hadoop.compression.lzo.LzoCodec
          core:io.compression.codecs: org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec
      secondaryWorkerConfig:
        numInstances: 2
        isPreemptible: true

the command used to create the cluster

gcloud dataproc workflow-templates instantiate-from-file --file file_name.yaml

gcloud version: 261.0.0


Solution

  • You can use the prerequisiteStepIds list in your final workflow step to make sure it only runs after all your prerequisite steps have run. You can see the expected structure in the corresponding JSON API representation for OrderedJob.

    jobs:
    - pigJob:
        continueOnFailure: true
        queryList:
          queries:
          - sh /ui.sh
      stepId: run-pig-ui
    - pigJob:
        continueOnFailure: true
        queryList:
          queries:
          - sh /hotel.sh
      stepId: run-pig-hotel
    - pigJob:
        continueOnFailure: true
        queryList:
          queries:
          - sh /final.sh
      stepId: run-final-step
      prerequisiteStepIds:
        - run-pig-ui
        - run-pig-hotel
    ...