Search code examples
gpugoogle-cloud-ml-engine

Google Cloud ML Engine GPUs error


I've created several jobs for training CNN using Google Cloud ML Engine, each time job finished successfully with GPU error. The printed device placement included some GPU activity, but there was no GPU usage in job details/utilization.

Here is the command I use for create a job:

gcloud beta ml-engine jobs submit training fei_test34 --job-dir gs://tfoutput/joboutput --package-path trainer --module-name=trainer.main --region europe-west1 --staging-bucket gs://tfoutput --scale-tier BASIC_GPU -- --data=gs://crispdata/cars_128 --max_epochs=1 --train_log_dir=gs://tfoutput/joboutput --model=trainer.crisp_model_2x64_2xBN --validation=True -x

Here is the device placement log: log device placement . GPU error: GPU error detail

More info:

When I ran my code on Google Cloud ML Engine, the average speed for training using one Tesla K80 was 8.2 example/sec, the average speed without using GPUs was 5.7 example/sec, with image size 112x112. Same code I got 130.4 example/sec using one GRID K520 on Amazon AWS. I thought that using Tesla K80 should get faster speed. Also, I got the GPU error I posted yesterday. Additionally, in Compute Engine Quotas, I can see the usage of CPU > 0%, but the usage of GPUs remains 0%. I was wondering whether GPU is really working.

I am not familiar with cloud computing, so not sure I've provided enough information. Feel free to ask for more details.

I just tried setting to complex_model_m_gpu, the training speed is about the same as one GPU (cause my code is for one GPU), but there is more information in the log. Here is the copy of the log:

I successfully opened CUDA library libcudnn.so.5 locally

I successfully opened CUDA library libcufft.so.8.0 locally

I successfully opened CUDA library libcuda.so.1 locally

I successfully opened CUDA library libcurand.so.8.0 locally

I Summary name cross_entropy (raw) is illegal; using cross_entropy__raw_ instead.

I Summary name total_loss (raw) is illegal; using total_loss__raw_ instead.

W The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.

W The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.

I successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

I Found device 0 with properties:

E name: Tesla K80

E major: 3 minor: 7 memoryClockRate (GHz) 0.8235

E pciBusID 0000:00:04.0

E Total memory: 11.20GiB

E Free memory: 11.13GiB

W creating context when one is currently active; existing: 0x39ec240

I successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

I Found device 1 with properties:

E name: Tesla K80

E major: 3 minor: 7 memoryClockRate (GHz) 0.8235

E pciBusID 0000:00:05.0

E Total memory: 11.20GiB

E Free memory: 11.13GiB

W creating context when one is currently active; existing: 0x39f00b0

I successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

I Found device 2 with properties:

E name: Tesla K80

E major: 3 minor: 7 memoryClockRate (GHz) 0.8235

E pciBusID 0000:00:06.0

E Total memory: 11.20GiB

E Free memory: 11.13GiB

W creating context when one is currently active; existing: 0x3a148b0

I successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

I Found device 3 with properties:

E name: Tesla K80

E major: 3 minor: 7 memoryClockRate (GHz) 0.8235

E pciBusID 0000:00:07.0

E Total memory: 11.20GiB

E Free memory: 11.13GiB

I Peer access not supported between device ordinals 0 and 1

I Peer access not supported between device ordinals 0 and 2

I Peer access not supported between device ordinals 0 and 3

I Peer access not supported between device ordinals 1 and 0

I Peer access not supported between device ordinals 1 and 2

I Peer access not supported between device ordinals 1 and 3

I Peer access not supported between device ordinals 2 and 0

I Peer access not supported between device ordinals 2 and 1

I Peer access not supported between device ordinals 2 and 3

I Peer access not supported between device ordinals 3 and 0

I Peer access not supported between device ordinals 3 and 1

I Peer access not supported between device ordinals 3 and 2

I DMA: 0 1 2 3

I 0: Y N N N

I 1: N Y N N

I 2: N N Y N

I 3: N N N Y

I Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0)

I Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:00:05.0)

I Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla K80, pci bus id: 0000:00:06.0)

I Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla K80, pci bus id: 0000:00:07.0)

I Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0)

I Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:00:05.0)

I Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla K80, pci bus id: 0000:00:06.0)

I Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla K80, pci bus id: 0000:00:07.0)

I 361

I bucket = crispdata, folder = cars_128/train

I path = gs://crispdata/cars_128/train

I Num examples = 240

I bucket = crispdata, folder = cars_128/val

I path = gs://crispdata/cars_128/val

I Num examples = 60

I {'flop': False, 'learning_rate_decay_factor': 0.005, 'train_log_dir': 'gs://tfoutput/joboutput/20170411_144221', 'valid_score_path': '/home/ubuntu/tensorflow/cifar10/validation_score.csv', 'saturate_epoch': 200, 'test_score_path': '', 'max_tries': 75, 'max_epochs': 10, 'id': '20170411_144221', 'test_data_size': 0, 'memory_usage': 0.3, 'load_size': 128, 'test_batch_size': 10, 'max_out_norm': 1.0, 'email_notify': False, 'skip_training': False, 'log_device_placement': False, 'learning_rate_decay_schedule': '', 'cpu_only': False, 'standardize': False, 'num_epochs_per_decay': 1, 'zoom_out': 0.0, 'val_data_size': 100, 'learning_rate': 0.1, 'grayscale': 0.0, 'train_data_size': 250, 'minimal_learning_rate': 1e-05, 'save_valid_scores': False, 'train_batch_size': 50, 'rotation': 0.0, 'val_epoch_size': 2, 'data': 'gs://crispdata/cars_128', 'val_batch_size': 50, 'num_classes': 2, 'learning_rate_decay': 'linear', 'random_seed': 5, 'num_threads': 1, 'num_gpus': 1, 'test_dir': '', 'shuffle_traindata': False, 'pca_jitter': 0.0, 'moving_average_decay': 1.0, 'sample_size': 128, 'job-dir': 'gs://tfoutput/joboutput', 'learning_algorithm': 'sgd', 'train_epoch_size': 5, 'model': 'trainer.crisp_model_2x64_2xBN', 'validation': False, 'tower_name': 'tower'}

I Filling queue with 100 CIFAR images before starting to train. This will take a few minutes.

I name: "train"

I op: "NoOp"

I input: "^GradientDescent"

I input: "^ExponentialMovingAverage"

I 128 128

I 2017-04-11 14:42:44.766116: epoch 0, loss = 0.71, lr = 0.100000 (5.3 examples/sec; 9.429 sec/batch)

I 2017-04-11 14:43:19.077377: epoch 1, loss = 0.53, lr = 0.099500 (8.1 examples/sec; 6.162 sec/batch)

I 2017-04-11 14:43:51.994015: epoch 2, loss = 0.40, lr = 0.099000 (7.7 examples/sec; 6.479 sec/batch)

I 2017-04-11 14:44:22.731741: epoch 3, loss = 0.39, lr = 0.098500 (8.2 examples/sec; 6.063 sec/batch)

I 2017-04-11 14:44:52.476539: epoch 4, loss = 0.24, lr = 0.098000 (8.4 examples/sec; 5.935 sec/batch)

I 2017-04-11 14:45:23.626918: epoch 5, loss = 0.29, lr = 0.097500 (8.1 examples/sec; 6.190 sec/batch)

I 2017-04-11 14:45:54.489606: epoch 6, loss = 0.56, lr = 0.097000 (8.6 examples/sec; 5.802 sec/batch)

I 2017-04-11 14:46:27.022781: epoch 7, loss = 0.12, lr = 0.096500 (6.4 examples/sec; 7.838 sec/batch)

I 2017-04-11 14:46:57.335240: epoch 8, loss = 0.25, lr = 0.096000 (8.7 examples/sec; 5.730 sec/batch)

I 2017-04-11 14:47:30.425189: epoch 9, loss = 0.11, lr = 0.095500 (7.8 examples/sec; 6.398 sec/batch)

Does this mean that GPUs are in use? If yes, any idea about the why there's a huge speed difference with Grid K520 when executing the same code?


Solution

  • So the log messages indicate that GPUs are available. To check whether GPUs are actually being used you can turn on logging of device placement to see which OPs are assigned to GPUs.

    The Cloud Compute console won't show any utilization metrics related to Cloud ML Engine. If you look at the Cloud Console UI for your jobs you will see memory and CPU graphs but not GPU graphs.