(this is a crossposting to SO, the jupyterhub issue tracker and the jupyterhub/systemdspawner issue tracker)
I have a private JupyterHub Setup using a SystemdSpawner
where I try to run tensorflow with gpu support.
I followed the tensorflow instructions and alternatively tried a already provisioned AWS AMI (Deep Learning Base AMI (Ubuntu 18.04) Version 21.0) with NDVIDIA, both on AWS EC2 g4 instances.
On both setups I'm able to use tensorflow with gpu support in an (i)python 3.6 shell
>>> import tensorflow as tf
>>> tf.config.list_physical_devices('GPU')
2020-02-12 10:57:13.670937: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-12 10:57:13.698230: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-12 10:57:13.699066: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:00:1e.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.73GiB deviceMemoryBandwidth: 298.08GiB/s
2020-02-12 10:57:13.699286: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-12 10:57:13.700918: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-12 10:57:13.702512: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-12 10:57:13.702814: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-12 10:57:13.704561: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-12 10:57:13.705586: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-12 10:57:13.709171: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-12 10:57:13.709278: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-12 10:57:13.710120: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-12 10:57:13.710893: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
(some warnings about NUMA node, but the gpu is found)
Also using nvidia-smi
and deviceQuery
shows the gpu:
$ nvidia-smi
Wed Feb 12 10:39:44 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 33C P8 9W / 70W | 0MiB / 15079MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
$ /usr/local/cuda/extras/demo_suite/deviceQuery
/usr/local/cuda/extras/demo_suite/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "Tesla T4"
CUDA Driver Version / Runtime Version 10.1 / 10.0
CUDA Capability Major/Minor version number: 7.5
Total amount of global memory: 15080 MBytes (15812263936 bytes)
(40) Multiprocessors, ( 64) CUDA Cores/MP: 2560 CUDA Cores
GPU Max Clock rate: 1590 MHz (1.59 GHz)
Memory Clock rate: 5001 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 4194304 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 30
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.0, NumDevs = 1, Device0 = Tesla T4
Result = PASS
Now I start the JupyterHub, login, and open a terminal, there I get:
$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
and
$ /usr/local/cuda/extras/demo_suite/deviceQuery
cuda/extras/demo_suite/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 38
-> no CUDA-capable device is detected
Result = FAIL
and also
I suspect some kind of "sandbox", missing ENV vars, etc. due to that the gpu drivers are not found inside the singleuser environment and subsequently the tensorflow gpu support does not work.
Any ideas on this? Probably it is either a small configuration tweak or due to the architecture not solvable at all ;)
Set c.SystemdSpawner.isolate_devices = False
in your jupyterhub_config.py
.
Here is an excerpt from the documentation:
Setting this to true provides a separate, private /dev for each user. This prevents the user from directly accessing hardware devices, which could be a potential source of security issues. /dev/null, /dev/zero, /dev/random and the ttyp pseudo-devices will be mounted already, so most users should see no change when this is enabled.
c.SystemdSpawner.isolate_devices = True
This requires systemd version > 227. If you enable this in earlier versions, spawning will fail.
Nvidia makes use of devices (i.e., files in/dev
). Please refer to their documentation for more information. There should be files in there named /dev/nvidia*
. Isolating devices with SystemdSpawner will prevent access to those Nvidia devices.
Is there a way to still isolate the devices and enable GPU support?
I am not sure... but I can provide pointers to documentation. Setting c.SystemdSpawner.isolate_devices = True
sets PrivateDevices=yes
in the eventual systemd-run
call (source). Refer to the systemd documentation for more information on the PrivateDevices
option.
You might be able to keep isolate_devices = True
and then explicitly mount the nvidia devices. Though I don't know how to do this...