Consumer-grade Nvidia GPUs are expected to have about 1-10 soft memory errors per week.
If you somehow manage to detect an error on a system without ECC (e.g. if the results were abnormal) what steps are necessary and sufficient to recover from it?
Is it enough to just reload all of the data to the GPU (cuda.memcpy_htod
in PyCuda),
or do you need to reboot the system? What about the "kernel", rather than data?
A soft memory error (meaning incorrect results due to noise of some kind), shouldn't require a reboot. Just rewind back to some known good position, reload data to the GPU and proceed.