Search code examples
google-cloud-platformgcloudgoogle-cloud-rungoogle-cloud-run-jobs

Why does `gcloud run jobs execute` punish you for waiting?


If I run:

gcloud run jobs execute foo --project myproj --region us-central1 --format json

It returns a big, beautiful data structure for the Execution resource that was generated in the cloud. This happens whether or not the execution ends up failing.

However, if I run:

gcloud run jobs execute foo --project myproj --region us-central1 --format json --wait

If the job fails, I get nothing - just a few lines of plain-text error reporting written to STDERR. Nothing structured that I can write tooling around.

X Creating execution... Task foo-8vn7q-task0 failed with message: The container exited with an error.                                                          
  ✓ Provisioning resources...                                                                                                                                    
  ✓ Starting execution...                                                                                                                                        
  X Running execution... 0 / 1 complete                                                                                                                          
Executing job failed                                                                                                                                             
ERROR: (gcloud.run.jobs.execute) The execution failed.
View details about this execution by running:
gcloud run jobs executions describe foo-8vn7q

Or visit https://console.cloud.google.com/run/jobs/executions/details/us-central1/foo-8vn7q/tasks?project=xxx

Why? Waiting longer should yield more data, not less. Why is gcloud punishing me for waiting? Why not still return the Execution record, but with the failure state recorded? (The same output produced by gcloud run jobs executions describe after-the-fact.)


Futhermore... I'd be willing to have my tool compensate for this by, in the event of gcloud returning a non-zero exit code on execution, follow it up with a describe command — if that command had a --wait flag I could use that would block until job execution completes. But it doesn't. So that just leaves polling.

In the meantime, I'm just going to have my tool disallow the --wait flag for executions in order to avoid this situation entirely.


Solution

  • Google's going to tell you that this is WAI (working as intended).

    I think the logic is that, the "big, beautiful data structure" (Execution) is, in fact, always returned by the service (--wait or not), it's just not surfaced by gcloud with an error because the result is... an error.

    I think what you want is to always receive the Execution (which is returned by the initial method to run the job) and, to do this, you could unpick the gcloud calls.

    If you add --log-http to a gcloud command, you'll see the underlying method calls and gain more insight into how the command is implemented.

    Interestingly, gcloud run jobs uses the v1 API not v2.

    When you gcloud run jobs execute, the command:

    1. POST's to namespaces.jobs.run
    2. Uses the Execution returned to get the ObjectMeta selfLink
    3. Uses the selfLink to (repeatedly) namespaces.executions.get updated Execution's

    The implementation effectively polls the endpoint interrogating updates to the returned Executions's ExecutionsStatus.

    The difference between not waiting and waiting (--wait) is that:

    • not waiting proceeds until the execution starts running; and
    • waiting proceeds until the execution completes (or fails).

    So, until you wait for Google to amend the behavior of gcloud run jobs execute, you could execute (e.g. curl) the methods directly to effect the behavior that you desire including always returning an Execution.

    Notes:

    1. There is a gcloud beta run jobs set of commands but these also use the v1 API and don't appear to behave differently.
    2. There is a v2 API (unused by gcloud) and this returns an Operation to the projects.locations.jobs.run method which is better aligned with other long-running requests.