I am creating dataproc cluster on GCP using a workflow template from YAML files. Once the cluster is created all the steps start executing in parallel but I want some steps to execute after all other steps have completed execution. is there any way to achieve this?
sample YAML used for cluster creation
jobs:
- pigJob:
continueOnFailure: true
queryList:
queries:
- sh /ui.sh
stepId: run-pig-ui
- pigJob:
continueOnFailure: true
queryList:
queries:
- sh /hotel.sh
stepId: run-pig-hotel
placement:
managedCluster:
clusterName: cluster-abc
labels:
data: cluster
config:
configBucket: bucket-1
initializationActions:
- executableFile: gs://bucket-1/install_git.sh
executionTimeout: 600s
gceClusterConfig:
zoneUri: asia-south1-a
tags:
- test
masterConfig:
machineTypeUri: n1-standard-8
diskConfig:
bootDiskSizeGb: 50
workerConfig:
machineTypeUri: n1-highcpu-32
numInstances: 2
diskConfig:
bootDiskSizeGb: 100
softwareConfig:
imageVersion: 1.4-ubuntu18
properties:
core:io.compression.codec.lzo.class: com.hadoop.compression.lzo.LzoCodec
core:io.compression.codecs: org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec
secondaryWorkerConfig:
numInstances: 2
isPreemptible: true
the command used to create the cluster
gcloud dataproc workflow-templates instantiate-from-file --file file_name.yaml
gcloud version: 261.0.0
You can use the prerequisiteStepIds
list in your final workflow step to make sure it only runs after all your prerequisite steps have run. You can see the expected structure in the corresponding JSON API representation for OrderedJob.
jobs:
- pigJob:
continueOnFailure: true
queryList:
queries:
- sh /ui.sh
stepId: run-pig-ui
- pigJob:
continueOnFailure: true
queryList:
queries:
- sh /hotel.sh
stepId: run-pig-hotel
- pigJob:
continueOnFailure: true
queryList:
queries:
- sh /final.sh
stepId: run-final-step
prerequisiteStepIds:
- run-pig-ui
- run-pig-hotel
...