Search code examples
cudaruntime-errorubuntu-12.04tesla

Disabled ECC support for Tesla C2070 and Ubuntu 12.04


I have a headless workstation running Ubuntu 12.04 server and recently installed new Tesla C2070 card, but when running the examples from the CUDA SDK, I get the following error:

NVIDIA_GPU_Computing_SDK/C/bin/linux/release% ./reduction 
[reduction] starting...

Using Device 0: Tesla C2070

Reducing array of type int

16777216 elements
256 threads (max)
64 blocks

reduction.cpp(473) : cudaSafeCallNoSync() Runtime API error 39 : uncorrectable ECC error encountered.

Actually, this error occurs with all other examples except "deviceQuery".

I'm using kernel 3.2.0, nvidia driver 295.41 and Cuda 4.2.9.

After a lot of searching found a suggestion to disable the ecc support by:

   nvidia-smi -g 0 --ecc-config=0

which worked. But the question is how reliable will be the GPU computing with disabled ecc support?

Any advice, suggestion or solution will be highly appreciated.

-Konstantin


Solution

  • I'm wondering if this may be some sort of compatibility issue, rather than a bad card. I'm suffering from the same problem with a Tesla C2075, same Ubuntu version. We contacted nVidia and they told us that double-bit ECC errors (as seen using nvidia-smi -q in linux) meant that the card was probably broken. We obtained a replacement, but it has exactly the same issues.

    It seems unlikely that both the boards I have had are broken in the same way, so we're going to try it in another machine if we can find a suitable one.

    I'll post anything interesting that we learn.