I'm trying to solve a problem that occurs on our cluster using Tensorflow v1.0.1 with GPUs and TORQUE v6.1.0 together with MOAB as job scheduler.
The error occurs when the executed python script tries to start a new Session:
[...]
with tf.Session() as sess:
[...]
The error message:
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
E tensorflow/core/common_runtime/direct_session.cc:137] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE
Load Data...
input: (12956, 128, 128, 1)
output: (12956, 64, 64, 16)
Initiliaze training
Traceback (most recent call last):
File "[...]/train.py", line 154, in <module>
tf.app.run()
File "[...]/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "[...]/train.py", line 150, in main
training()
File "[...]/train.py", line 72, in training
with tf.Session() as sess:
File "[...]/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1176, in __init__
super(Session, self).__init__(target, graph, config=config)
File "[...]/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 552, in __init__
self._session = tf_session.TF_NewDeprecatedSession(opts, status)
File "[...]/python/3.5.1/lib/python3.5/contextlib.py", line 66, in __exit__
next(self.gen)
File "[...]/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.
To reproduce the problem I executed the script directly on an offline GPU-Node (so without TORQUE involved) and it threw no error. Therefore I assume that the problem has something to do with TORQUE, but I haven't found a solution.
The parameters for TORQUE:
#PBS -l nodes=1:ppn=2:gpus=4:exclusive_process
#PBS -l mem=25gb
I tried it once without the exclusive_process
, but the Job was not executed. I think this flag is needed by our scheduler when GPUs are involved.
I think I found a way to get the job running by changing the compute mode from 'exclusive_process' to 'shared'.
Now the job starts and it seems to compute something. But I ask myself if all four GPUs are used, because of the output of nvidia-smi. Why are all GPUs working on the same process?
Fri May 26 13:41:33 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26 Driver Version: 375.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 0000:04:00.0 Off | 0 |
| N/A 45C P0 58W / 149W | 10871MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 On | 0000:05:00.0 Off | 0 |
| N/A 37C P0 70W / 149W | 10873MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 On | 0000:84:00.0 Off | 0 |
| N/A 32C P0 59W / 149W | 10871MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 On | 0000:85:00.0 Off | 0 |
| N/A 58C P0 143W / 149W | 11000MiB / 11439MiB | 95% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 11757 C python 10867MiB |
| 1 11757 C python 10869MiB |
| 2 11757 C python 10867MiB |
| 3 11757 C python 10996MiB |
+-----------------------------------------------------------------------------+