Search code examples
machine-learningtensorflowgpupbstorque

Tensorflow: Problems with TORQUE and GPUs when starting new session: CUDA_ERROR_INVALID_DEVICE


I'm trying to solve a problem that occurs on our cluster using Tensorflow v1.0.1 with GPUs and TORQUE v6.1.0 together with MOAB as job scheduler.

The error occurs when the executed python script tries to start a new Session:

[...]
with tf.Session() as sess:
[...]

The error message:

I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
E tensorflow/core/common_runtime/direct_session.cc:137] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE
Load Data...
input: (12956, 128, 128, 1)
output: (12956, 64, 64, 16)
Initiliaze training
Traceback (most recent call last):
  File "[...]/train.py", line 154, in <module>
tf.app.run()
  File "[...]/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "[...]/train.py", line 150, in main
training()
  File "[...]/train.py", line 72, in training
with tf.Session() as sess:
  File "[...]/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1176, in __init__
super(Session, self).__init__(target, graph, config=config)
  File "[...]/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 552, in __init__
self._session = tf_session.TF_NewDeprecatedSession(opts, status)
  File "[...]/python/3.5.1/lib/python3.5/contextlib.py", line 66, in __exit__
next(self.gen)
  File "[...]/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.

To reproduce the problem I executed the script directly on an offline GPU-Node (so without TORQUE involved) and it threw no error. Therefore I assume that the problem has something to do with TORQUE, but I haven't found a solution.

The parameters for TORQUE:

#PBS -l nodes=1:ppn=2:gpus=4:exclusive_process
#PBS -l mem=25gb

I tried it once without the exclusive_process, but the Job was not executed. I think this flag is needed by our scheduler when GPUs are involved.


Solution

  • I think I found a way to get the job running by changing the compute mode from 'exclusive_process' to 'shared'.

    Now the job starts and it seems to compute something. But I ask myself if all four GPUs are used, because of the output of nvidia-smi. Why are all GPUs working on the same process?

        Fri May 26 13:41:33 2017       
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Tesla K80           On   | 0000:04:00.0     Off |                    0 |
    | N/A   45C    P0    58W / 149W |  10871MiB / 11439MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    |   1  Tesla K80           On   | 0000:05:00.0     Off |                    0 |
    | N/A   37C    P0    70W / 149W |  10873MiB / 11439MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    |   2  Tesla K80           On   | 0000:84:00.0     Off |                    0 |
    | N/A   32C    P0    59W / 149W |  10871MiB / 11439MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    |   3  Tesla K80           On   | 0000:85:00.0     Off |                    0 |
    | N/A   58C    P0   143W / 149W |  11000MiB / 11439MiB |     95%      Default |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID  Type  Process name                               Usage      |
    |=============================================================================|
    |    0     11757    C   python                                       10867MiB |
    |    1     11757    C   python                                       10869MiB |
    |    2     11757    C   python                                       10867MiB |
    |    3     11757    C   python                                       10996MiB |
    +-----------------------------------------------------------------------------+