Search code examples
linuxgpucontainerdnvidia-dockergvisor

cannot create sandbox: cannot read client sync file: waiting for sandbox to start: EOF: unknown


Without Gvisor

I am trying to use Nvidia GPU with gvisor in containerd with crictl. Nvidia GPU is working fine with runc. Using below configuration to start with runc. file /etc/nvidia-container-runtime/config.toml

#accept-nvidia-visible-devices-as-volume-mounts = false
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"
#swarm-resource = "DOCKER_RESOURCE_GPU"

[nvidia-container-cli]
#debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "@/sbin/ldconfig"
load-kmods = true
#no-cgroups = false
#path = "/usr/bin/nvidia-container-cli"
#root = "/run/nvidia/driver"
#user = "root:video"

[nvidia-container-runtime]
debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"
mode = "legacy"
runtimes = ["runc"]

[nvidia-container-runtime.modes]

[nvidia-container-runtime.modes.cdi]
annotation-prefixes = ["cdi.k8s.io/"]
default-kind = "nvidia.com/gpu"
spec-dirs = ["/etc/cdi", "/var/run/cdi"]

[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

[nvidia-container-runtime-hook]
path = "nvidia-container-runtime-hook"
skip-mode-detection = false

[nvidia-ctk]
path = "nvidia-ctk"

used nvidia-ctk to configure containerd. /etc/containerd/config.toml has new runtime. Used the below commands to start the GPU container.

SANDBOX_ID=$(sudo crictl runp --runtime nvidia pod.json)
CONTAINER_ID=$(sudo crictl create ${SANDBOX_ID} container.json pod.json)
sudo crictl start ${CONTAINER_ID}

So without Gvisor, the GPU inside the container working fine.

With Gvisor

As specified in documentation in gvisor https://gvisor.dev/docs/user_guide/gpu/ to create a script like below, made it executable & kept it in PATH at /usr/local/bin/runscgpu.

#!/bin/bash
exec /usr/local/bin/runsc --nvproxy "$@"

Then added this new runtime in /etc/nvidia-container-runtime/config.toml like below,

....

[nvidia-container-runtime]
debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"
mode = "legacy"
runtimes = ["runscgpu", "runc"]

....

When I try to create pod using SANDBOX_ID=$(sudo crictl runp --runtime nvidia pod.json) getting below error,

DEBU[0000] RunPodSandboxRequest: &RunPodSandboxRequest{Config:&PodSandboxConfig{Metadata:&PodSandboxMetadata{Name:nvidia-sandbox,Uid:4dis4d93djaidwnduwk28bcsb,Namespace:default,Attempt:1,},Hostname:,LogDirectory:/tmp,DnsConfig:nil,PortMappings:[&PortMapping{Protocol:TCP,ContainerPort:80,HostPort:8081,HostIp:,}],Labels:map[string]string{},Annotations:map[string]string{},Linux:&LinuxPodSandboxConfig{CgroupParent:,SecurityContext:nil,Sysctls:map[string]string{},},},RuntimeHandler:nvidia,}
DEBU[0001] RunPodSandboxResponse: nil
FATA[0001] run pod sandbox failed: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: creating container: cannot create sandbox: cannot read client sync file: waiting for sandbox to start: EOF: unknown

Logs

Oct 06 15:40:57 ip-172-21-148-200.eu-central-1.compute.internal containerd[1880]: time="2024-10-06T15:40:57.473234077Z" level=error msg="RunPodSandbox for name:\"nvidia-sandbox\"  uid:\"4dis4d93djaidwnduwk28bcsb\"  namespace:\"default\"  attempt:1 failed, error" error="failed to create containerd task: failed to create shim task: OCI runtime create failed: creating container: cannot create sandbox: cannot read client sync file: waiting for sandbox to start: EOF: unknown"
Oct 06 15:52:05 ip-172-21-148-200.eu-central-1.compute.internal containerd[1880]: time="2024-10-06T15:52:05.147098477Z" level=info msg="RunPodSandbox for name:\"nvidia-sandbox\"  uid:\"4dis4d93djaidwnduwk28bcsb\"  namespace:\"default\"  attempt:1"
Oct 06 15:52:05 ip-172-21-148-200.eu-central-1.compute.internal containerd[1880]: time="2024-10-06T15:52:05.212728975Z" level=info msg="loading plugin \"io.containerd.event.v1.publisher\"..." runtime=io.containerd.runc.v2 type=io.containerd.event.v1
Oct 06 15:52:05 ip-172-21-148-200.eu-central-1.compute.internal containerd[1880]: time="2024-10-06T15:52:05.212788159Z" level=info msg="loading plugin \"io.containerd.internal.v1.shutdown\"..." runtime=io.containerd.runc.v2 type=io.containerd.internal.v1
Oct 06 15:52:05 ip-172-21-148-200.eu-central-1.compute.internal containerd[1880]: time="2024-10-06T15:52:05.212805465Z" level=info msg="loading plugin \"io.containerd.ttrpc.v1.task\"..." runtime=io.containerd.runc.v2 type=io.containerd.ttrpc.v1
Oct 06 15:52:05 ip-172-21-148-200.eu-central-1.compute.internal containerd[1880]: time="2024-10-06T15:52:05.212916629Z" level=info msg="loading plugin \"io.containerd.ttrpc.v1.pause\"..." runtime=io.containerd.runc.v2 type=io.containerd.ttrpc.v1
Oct 06 15:52:05 ip-172-21-148-200.eu-central-1.compute.internal containerd[1880]: time="2024-10-06T15:52:05.681764912Z" level=info msg="shim disconnected" id=2b400251e14e343494e8f6bf73c3de54c56cf777dfa5893098ddcd4168ec3856 namespace=k8s.io
Oct 06 15:52:05 ip-172-21-148-200.eu-central-1.compute.internal containerd[1880]: time="2024-10-06T15:52:05.681827010Z" level=warning msg="cleaning up after shim disconnected" id=2b400251e14e343494e8f6bf73c3de54c56cf777dfa5893098ddcd4168ec3856 namespace=k8s.io
Oct 06 15:52:05 ip-172-21-148-200.eu-central-1.compute.internal containerd[1880]: time="2024-10-06T15:52:05.681837475Z" level=info msg="cleaning up dead shim" namespace=k8s.io
Oct 06 15:52:05 ip-172-21-148-200.eu-central-1.compute.internal containerd[1880]: time="2024-10-06T15:52:05.773340581Z" level=warning msg="cleanup warnings time=\"2024-10-06T15:52:05Z\" level=warning msg=\"failed to read init pid file\" error=\"open /run/containerd/io.containerd.runtime.v2.task/k8s.io/2b400251e14e343494e8f6bf73c3de54c56cf777dfa5893098ddcd4168ec3856/init.pid: no such file or directory\" runtime=io.containerd.runc.v2\n" namespace=k8s.io
Oct 06 15:52:05 ip-172-21-148-200.eu-central-1.compute.internal containerd[1880]: time="2024-10-06T15:52:05.773732749Z" level=error msg="copy shim log" error="read /proc/self/fd/13: file already closed" namespace=k8s.io
Oct 06 15:52:06 ip-172-21-148-200.eu-central-1.compute.internal containerd[1880]: time="2024-10-06T15:52:06.153555019Z" level=error msg="RunPodSandbox for name:\"nvidia-sandbox\"  uid:\"4dis4d93djaidwnduwk28bcsb\"  namespace:\"default\"  attempt:1 failed, error" error="failed to create containerd task: failed to create shim task: OCI runtime create failed: creating container: cannot create sandbox: cannot read client sync file: waiting for sandbox to start: EOF: unknown"

Solution

  • Follow https://github.com/google/gvisor/issues/10997 for all answers. I got it working.