Search code examples
tensorflowgoogle-cloud-ml-engine

Cannot resubmit job to ml-engine because "A job with this id already exists"


I am trying to submit a job to gcloud ml-engine. For reference the job is using this sample provided by Google

It went through the first time, but with errors unrelated to this question, and now I am trying reissue the command after having corrected my errors:

gcloud ml-engine jobs submit training $JOB_NAME \
                                    --stream-logs \
                                    --runtime-version 1.0 \
                                    --job-dir $GCS_JOB_DIR \
                                    --module-name trainer.task \
                                    --package-path trainer/ \
                                    --region us-east1 \
                                    -- \
                                    --train-files $TRAIN_GCS_FILE \
                                    --eval-files $EVAL_GCS_FILE \
                                    --train-steps $TRAIN_STEPS

, where $JOB_NAME = census. Unfortunately, it seems that I cannot proceed with resubmitting the job unless I change $JOB_NAME to be something like census2, then census3, etc. for every new job.

The following is the error I receive:

ERROR: (gcloud.ml-engine.jobs.submit.training) Project [my-project-name]
is the subject of a conflict: Field: job.job_id Error: A job with this
id already exists.

Is this part of the design to not be able to resubmit using the same job name or I am missing something?


Solution

  • Not sure if this will help but in Google's sample code for flowers, the error is avoided by appending the date and time to the job id as shown on line 22, e.g.,

    declare -r JOB_ID="flowers_${USER}_$(date +%Y%m%d_%H%M%S)"