I signed up for the Tensor Research Cloud (TRC) program for the third time in two years. Now I barely created a preemptible v3-8 TPU. Before that, I could efficiently allocate five non-preemptible v3-8 TPUs. Even with this allocation (preemptible and non-preemptible), the TPU is listed as READY
and HEALTHY
. However, when I try to access it from the pretraining script, I run into this error that I have never encountered before:
Failed to connect to the Tensorflow master. The TPU worker may not be ready (still scheduling), or the Tensorflow master address is incorrect.
I know that the TensorFlow master address is correct, and I have checked that the TPU is healthy and ready. I have also double-checked that my code is correctly creating the TensorFlow session and specifying the TPU address.
What could be causing this error message, and how can I troubleshoot and fix it?
I also tried this code from https://www.tensorflow.org/guide/tpu. Note that I'm not using Colab but using Google Cloud Platform.
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='pretrain-1')
tf.config.experimental_connect_to_cluster(resolver)
# This is the TPU initialization code that has to be at the beginning.
tf.tpu.experimental.initialize_tpu_system(resolver)
print("All devices: ", tf.config.list_logical_devices('TPU'))
Any I'm stuck at:
INFO:tensorflow:Initializing the TPU system: pretrain-1
However, I expected something like this:
INFO:tensorflow:Deallocate tpu buffers before initializing tpu system.
2022-12-20 13:08:56.187870: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
INFO:tensorflow:Deallocate tpu buffers before initializing tpu system.
INFO:tensorflow:Initializing the TPU system: grpc://10.99.59.162:8470
INFO:tensorflow:Initializing the TPU system: grpc://10.99.59.162:8470
INFO:tensorflow:Finished initializing TPU system.
INFO:tensorflow:Finished initializing TPU system.
All devices: [LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:0', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:1', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:2', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:3', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:4', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:5', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:6', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:7', device_type='TPU')]
Edit: I successfully accessed the TPU with the same configurations from a new Tensor Research Cloud (TRC) account. However, the problem is still ongoing with the previous TRC account. I suspect it might be a problem with the Google Cloud Platform (GCP) configuration.
I solved the problem by deleting all TPUs and VM instances and then disabling and reenabling all APIs.
The issue might be related to the VPN connection to a GPU cluster during enabling services.