Search code examples
tensorflowgoogle-cloud-platformtpu

Google Cloud Platform TPU pod v3.32 RuntimeError: TPU cores on each host is not same


Using an example of TPU usage on a pod v3.32 with software "tpu-vm-tf-2.11.0-pod" is always giving the same error when initializing TF's TPUStrategy:

RuntimeError: TPU cores on each host is not same. This should not happen!. devices: [_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0), _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0), _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:0, TPU, 0, 0), _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:1, TPU, 0, 0), _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:2, TPU, 0, 0), _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:3, TPU, 0, 0), _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:4, TPU, 0, 0), _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:5, TPU, 0, 0), _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:6, TPU, 0, 0), _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:7, TPU, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:3/device:CPU:0, CPU, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:3/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:3/device:TPU:0, TPU, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:3/device:TPU:1, TPU, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:3/device:TPU:2, TPU, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:3/device:TPU:3, TPU, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:3/device:TPU:4, TPU, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:3/device:TPU:5, TPU, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:3/device:TPU:6, TPU, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:3/device:TPU:7, TPU, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:1/device:CPU:0, CPU, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:1/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:1/device:TPU:0, TPU, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:1/device:TPU:1, TPU, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:1/device:TPU:2, TPU, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:1/device:TPU:3, TPU, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:1/device:TPU:4, TPU, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:1/device:TPU:5, TPU, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:1/device:TPU:6, TPU, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:1/device:TPU:7, TPU, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:2/device:CPU:0, CPU, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:2/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:2/device:TPU:0, TPU, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:2/device:TPU:1, TPU, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:2/device:TPU:2, TPU, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:2/device:TPU:3, TPU, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:2/device:TPU:4, TPU, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:2/device:TPU:5, TPU, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:2/device:TPU:6, TPU, 0, 0), _DeviceAttributes(/job:worker/replica:0/task:2/device:TPU:7, TPU, 0, 0)]

Here is the minimal example script I'm executing:

import tensorflow as tf
print("Tensorflow version " + tf.__version__)

cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver()
print('Running on TPU ', cluster_resolver.cluster_spec().as_dict()['worker'])

tf.config.experimental_connect_to_cluster(cluster_resolver)
tf.tpu.experimental.initialize_tpu_system(cluster_resolver)
strategy = tf.distribute.TPUStrategy(cluster_resolver)

@tf.function
def add_fn(x,y):
  z = x + y
  return z

x = tf.constant(1.)
y = tf.constant(1.)
z = strategy.run(add_fn, args=(x,y))
print(z)

And the command used, where tpu-test-pod is the name of the tpu:

TPU_NAME=tpu-test-pod python3 tpu-test.py

Tried recreating the TPU but is still getting the same error. Anyone can help me? Thx


Solution

  • you are getting this error because you are missing one more important env variable. You need to also export TPU_LOAD_LIBRARY=0 for pods. If you do this, your code should be running. For more details please follow this guide step by step https://cloud.google.com/tpu/docs/tensorflow-pods#set_up_a_tpu_vm_pod_running_tensorflow_and_run_a_calculation