tensorflow kubernetes google-cloud-platform tpu google-cloud-tpu

TPU returning "failed call to cuInit: UNKNOWN ERROR (303)" on Google Cloud with Kubernetes Cluster

I am trying to use a TPU with Google Cloud's Kubernetes engine. My code returns several errors when I try to initialize the TPU, and any other operations only run on the CPU. To run this program, I am transferring a Python file from my Dockerhub workspace to Kubernetes, then executing it on a single v2 preemptible TPU. The TPU uses Tensorflow 2.3, which is the latest supported version for Cloud TPUs to the best of my knowledge. (I get an error saying the version is not yet supported when I try to use Tensorflow 2.4 or 2.5).

When I run my code, Google Cloud sees the TPU but fails to connect to it and instead uses the CPU. It returns this error:

tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory

tensorflow/stream_executor/cuda/cuda_driver.cc:312] failed call to cuInit: UNKNOWN ERROR (303)

tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (resnet-tpu-fxgz7): /proc/driver/nvidia/version does not exist

tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA

To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2299995000 Hz

tensorflow/compiler/xla/service/service.cc:168] XLA service 0x561fb2112c20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:

tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version

tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.8.16.2:8470}

tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:30001}

tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.8.16.2:8470}

tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:30001}

tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:405] Started server with target: grpc://localhost:30001

TPU name grpc://10.8.16.2:8470

The errors seem to indicate that tensorflow needs NVIDIA packages installed, but I understood from the Google Cloud TPU documentation that I shouldn't need to use tensorflow-gpu for a TPU. I tried using tensorflow-gpu anyways and received the same error, so I am not sure how to fix this problem. I've tried deleting and recreating my cluster and TPU numerous times, but I can't seem to make any progress. I'm relatively new to Google Cloud, so I may be missing something obvious, but any help would be greatly appreciated.

This is the Python script I am trying to run:

import tensorflow as tf
import os

import sys


# Parse the TPU name argument 
tpu_name = sys.argv[1]
tpu_name = tpu_name.replace('--tpu=', '')
print("TPU name", tpu_name)


tpu = tf.distribute.cluster_resolver.TPUClusterResolver(tpu_name)  # TPU detection

tpu_name = 'grpc://' + str(tpu.cluster_spec().as_dict()['worker'][0])

print("TPU name", tpu_name)
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)

Here is the yaml configuration file for my Kubernetes cluster (though I'm including a placeholder for my real workspace name and image for this post):

apiVersion: batch/v1
kind: Job
metadata:
  name: test
spec:
  template:
    metadata:
      name: test 
      annotations:
        tf-version.cloud-tpus.google.com: "2.3"
    spec:
      restartPolicy: Never
      imagePullSecrets:
        - name: regcred
      containers:
        - name:  test
          image: my_workspace/image 
          command: ["/bin/bash","-c","pip3 install cloud-tpu-client tensorflow==2.3.0 && python3 DebugTPU.py --tpu=$(KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS)"]

          resources:
            limits:
              cloud-tpus.google.com/preemptible-v2: 8
  backoffLimit: 0

Solution

There are actually no errors in this workload you've provided or the logs. A few comments which I think might help:

pip install tensorflow as you have noted installs tensorflow-gpu. By default, it tries to run GPU specific initializations and fails (failed call to cuInit: UNKNOWN ERROR (303)), so it falls back to local CPU execution. This is an error if you're trying to develop on a GPU VM, but in a typical CPU workload that doesn't matter. Essentially tensorflow == tensorflow-gpu and without a GPU available it's equivalent to tensorflow-cpu with additional error messages. Installing tensorflow-cpu would make these warnings go away.
In this workload, the TPU server has its own installation of TensorFlow running as well. It actually doesn't matter if your local VM (e.g. your GKE container) has tensorflow-gpu or tensorflow-cpu, as long as it's the same TF version as the TPU server. Your workload here is successfully connecting to the TPU server, indicated by:

tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.8.16.2:8470}

tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:30001}

tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.8.16.2:8470}

tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:30001}

tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:405] Started server with target: grpc://localhost:30001