Search code examples
google-cloud-datalab

Google datalab errors install Nvidia driver and starting docker container


I'm following https://cloud.google.com/datalab/docs/quickstart (datalab beta create-gpu [datalab-instance-name]). The instance gets created, but the docker container fails to start:

$docker ps -a:

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES e44d71c07f6e gcr.io/cos-cloud/cos-gpu-installer:latest "/bin/sh -c /entry..." 13 minutes ago Exited (2) 12 minutes ago awesome_brattain 56e54c3d3f6d gcr.io/cos-cloud/cos-gpu-installer:latest "/bin/sh -c /entry..." 14 minutes ago Exited (2) 13 minutes ago naughty_montalcini

Hard to read, but they are all STATUS=Exited

The first bad thing I can see:

$ sudo journalctl --since yesterday -fu docker.service has a strange error: Apr 22 20:53:30 seth2 dockerd[668]: time="2018-04-22T20:53:30.717669594Z" level=error msg="containerd: start container" error="oci runtime error: container_linux.go:247: starting container process caused \"chdir to cwd (\\\"/content/datalab/notebooks\\\") set in config.json failed: no such file or directory\"\n" id=4795b951f1dbae3a23dae67c2d5aaa7a8bc61e1f4fd6ec58814d241da75b245f

And surely, there is no /content directory. gcloud lists the disk as READY.

The second bad thing I can see:

$ docker logs e44d71c07f6e looks fine until the end:

[INFO 2018-04-22 20:56:33 UTC] Running Nvidia installer /usr/local/nvidia / NVIDIA-Linux-x86_64-384.81.run: 1: NVIDIA-Linux-x86_64-384.81.run: Syntax error: redirection unexpected s

I'm pretty much ready to call this beta functionality a dumpster fire, at least for someone with my novicity with respect to GCP, and try another provider.

Anyone have any ideas I might try though? Thank you so much in advance.


Solution

  • Sorry you hit this.

    This is a new bug for which we have a fix, but that fix has not yet been released (our release process takes at least a week).

    The issue was a recent change to the Container Optimized OS tooling that broke support for older Nvidia drivers.

    The fix is to update the driver version used by Datalab instances.

    Until the fix makes it out into a release, you can work around the issue by downloading the source code for the tool and run that version instead of the released version.