Search code examples
google-cloud-mlgoogle-cloud-ml-engine

CloudML job + verbosity == Error


Runnning the dataeng-machine-learning codelab on step 9. 4. Feature Engineering.

The notebook step for running a tarin job is: %%bash OUTDIR=gs://${BUCKET}/taxifare/ch4/taxi_trained JOBNAME=lab4a_$(date -u +%y%m%d_%H%M%S) echo $OUTDIR $REGION $JOBNAME gsutil -m rm -rf $OUTDIR gcloud ml-engine jobs submit training $JOBNAME \ --region=$REGION \ --module-name=trainer.task \ --package-path=${REPO}/courses/machine_learning/feateng/taxifare/trainer \ --job-dir=$OUTDIR \ --staging-bucket=gs://$BUCKET \ --scale-tier=BASIC \ --runtime-version=1.0 \ -- \ --train_data_paths="gs://$BUCKET/taxifare/ch4/taxi_preproc/train*" \ --eval_data_paths="gs://${BUCKET}/taxifare/ch4/taxi_preproc/valid*" \ --output_dir=$OUTDIR \ --num_epochs=100

That works great no matter how many time I run it.

However if I run: %%bash OUTDIR=gs://${BUCKET}/taxifare/ch4/taxi_trained JOBNAME=lab4a_$(date -u +%y%m%d_%H%M%S) echo $OUTDIR $REGION $JOBNAME gsutil -m rm -rf $OUTDIR gcloud ml-engine jobs submit training $JOBNAME \ --region=$REGION \ --module-name=trainer.task \ --package-path=${REPO}/courses/machine_learning/feateng/taxifare/trainer \ --job-dir=$OUTDIR \ --staging-bucket=gs://$BUCKET \ --scale-tier=BASIC \ --runtime-version=1.0 \ -- \ --train_data_paths="gs://$BUCKET/taxifare/ch4/taxi_preproc/train*" \ --eval_data_paths="gs://${BUCKET}/taxifare/ch4/taxi_preproc/valid*" \ --output_dir=$OUTDIR \ --num_epochs=100 \ --verbosity DEBUG

Job fails after about 40 sec. with this in the logs: The replica master 0 exited with a non-zero status of 2. Termination reason: Error.

I've found this usage in here: https://cloud.google.com/ml-engine/docs/how-tos/getting-started-training-prediction#cloud-train-single

So I guesss it's ok to use.

What am I doing wrong?


Solution

  • Note that every argument after the "-- \" line is a pass through to the tensorflow code and is therefore dependent on the individual sample code.

    In this case, the "--verbosity" flag isn't supported by the sample you are running. Looking at the samples repo, it looks like the only sample that has that flag is the census estimator sample.