When I train locally using google cloud buckets as data source and destination with:
gcloud ml-engine local train --module-name trainer.task_v2s --package-path trainer/
I get normal results and checkpoints are getting saved properly in 20 seps since my dataset is 400 examples and I use 20 as batchsize: 400/20 = 20 steps = 1 Epoch. These files get saved in my model dir in the bucket
model.ckpt-0.data-00000-of-00001
model.ckpt-0.index
model.ckpt-0.meta
model.ckpt-20.data-00000-of-00001
model.ckpt-20.index
model.ckpt-20.meta
Furthermore my local GPU is properly engaged:
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1018 G /usr/lib/xorg/Xorg 212MiB |
| 0 1889 G compiz 69MiB |
| 0 5484 C ...rtualenvs/my_project/bin/python 2577MiB |
+-----------------------------------------------------------------------------+
When I now try to use a gcloud compute unit:
gcloud ml-engine jobs submit training my_job_name \
--module-name trainer.task_v2s --package-path trainer/ \
--staging-bucket gs://my-bucket --region europe-west1 \
--scale-tier BASIC_GPU --runtime-version 1.8 --python-version 3.5
It takes around the same time to save a checkpoint, but it is getting saved in 1 step increment, though the data sources have not changed. The loss is also decreasing way slower, as it would when only one example would be trained. This is how the files look:
The GPU is also not getting engaged at all:
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I'm using a custom estimator with no configured clusterspec, as I assume you only need that for distributed camputing and my run_config looks like this:
Using config: {'_master': '', '_num_ps_replicas': 0, '_session_config': None, '_task_id': 0, '_model_dir': 'gs://my_bucket/model_dir', '_save_checkpoints_steps': None, '_tf_random_seed': None, '_task_type': 'master', '_keep_checkpoint_max': 5, '_evaluation_master': '', '_device_fn': None, '_save_checkpoints_secs': 600, '_save_summary_steps': 100, '_cluster_spec': , '_log_step_count_steps': 100, '_is_chief': True, '_global_id_in_cluster': 0, '_num_worker_replicas': 1, '_service': None, '_keep_checkpoint_every_n_hours': 10000, '_train_distribute': None}
From the logs I can also see the TF_CONFIG environment variable:
{'environment': 'cloud', 'cluster': {'master': ['127.0.0.1:2222']}, 'job': {'python_version': '3.5', 'run_on_raw_vm': True, 'package_uris': ['gs://my-bucket/my-project10/27cb2041a4ae5a14c18d6e7f8622d9c20789e3294079ad58ab5211d8e09a2669/MyProject-0.9.tar.gz'], 'runtime_version': '1.8', 'python_module': 'trainer.task_v2s', 'scale_tier': 'BASIC_GPU', 'region': 'europe-west1'}, 'task': {'cloud': 'qc6f9ce45ab3ea3e9-ml', 'type': 'master', 'index': 0}}
My guess is that I need to configure something I haven't but I have no idea what. I also do get some warnings at the beginning, but I don't think they have something to do with this:
google-cloud-vision 0.29.0 has requirement requests<3.0dev,>=2.18.4, but you'll have requests 2.13.0 which is incompatible.
I just found my error: I needed to put tensorflow-gpu instead of tensorflow in my setup.py. Even better is, as rhaertel80 stated, to omit tensorflow all together.