How to Properly Recover from Memory Errors in GPU?

Consumer-grade Nvidia GPUs are expected to have about 1-10 soft memory errors per week.

If you somehow manage to detect an error on a system without ECC (e.g. if the results were abnormal) what steps are necessary and sufficient to recover from it?

Is it enough to just reload all of the data to the GPU (cuda.memcpy_htod in PyCuda), or do you need to reboot the system? What about the "kernel", rather than data?

Solution

A soft memory error (meaning incorrect results due to noise of some kind), shouldn't require a reboot. Just rewind back to some known good position, reload data to the GPU and proceed.

Is this declaration UB?
Calling an external C function (in a shared lib) from Perl with Inline/C does not work
Eclipse C/C++ how to find variable belongs to which struct quickly
Is this truly best way to delete last element in C?
Why do we need address virtualization in an operating system?
Can a real floating-point type alias a complex floating-point type in C?
Shadowing an iterator inside a for loop has undefined (?) behaviour in C
Can a C program execute successfully if main() is defined as a macro?
Ambiguity in scope of for loop declaration versus body
Understanding the difference in timing of two functions that increment each element of an integer array
Why MSVC generates warning C4127 when constant is used in "while" - C
Executing a user-space function from the kernel space
Why must the variable used to hold getchar's return value be declared as int?
mpirun -np 4 ./a.out doesn't use all my cores (ubuntu 24.04LTS)
Sleep for N seconds and wait for keypress
Dereference twice in gdb
Is it safe to cast a struct pointer to a different struct pointer having a prefix of elements?
Can a C compiler legally reject a program if its call stack depth exceeds a fixed limit at compile time?
return type defaults to 'int' [-Wimplicit-int]
What is the scope of `fesetround()`?
Which specific optimization flag causes libm functions to be treated as pure?
How to render text in SDL2?
Why would you use 'extern "C++"'?
Strange Behavior Compiler Ignoring NULL Check Unless I Print Something in the if Statement
Fast inverse square root using fixed point instead of floating point
What is the const qualifier attached to in C: the memory area or the pointer?
GCC options for strictest C code?
How to do an explicit fall-through in C
How do compilers treat CONST qualifier when the pointer points to a memory location obtained with malloc()?
C: cmocka headers - how to unittest?