I am trying to build a kubeflow pipeline where I run two components (with a GPU constraint) in parallel. It seemed like a non-issue, but every time I tried it, one component would get stuck at "pending" until the other component is done.
The two components I am testing are simple while
loops with a GPU constraint:
while_op1 = while_loop_op(image_name='tensorflow/tensorflow:1.15.2-py3')
while_op1.name = 'while-1-gpu'
while_op1.set_security_context(V1SecurityContext(privileged=True))
while_op1.apply(gcp.use_gcp_secret('user-gcp-sa'))
while_op1.add_pvolumes({pv_base_path: _volume_op.volume})
while_op1.add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-p100')
while_op1.set_gpu_limit(1)
while_op1.after(init_op)
Where while_loop_op
:
import kfp.components as comp
def while_loop_op(image_name):
def while_loop():
import time
max_count = 300
count = 0
while True:
if count >= max_count:
print('Done.')
break
time.sleep(10)
count += 10
print("{} seconds have passed...".format(count))
op = comp.func_to_container_op(while_loop, base_image=image_name)
return op()
the issue might be related to your use of volumes. Have you tried to use the more supported data passing mechanisms?
For example, take this pipeline: https://github.com/kubeflow/pipelines/blob/091316b8bf3790e14e2418843ff67a3072cfadc0/components/XGBoost/_samples/sample_pipeline.py
Apply the GPU-related customizations to the trainer:
some_task.add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-p100')
some_task.set_gpu_limit(1)
Put the trainer and predictor inside a for _ in range(10):
loop so that you have 10 parallel copies.
Check whether the trainers run in parallel.
P.S. It's better to create issues in the official repo: https://github.com/kubeflow/pipelines/issues