Running CUDA GUI samples from a passive (inactive) GPU

I managed to successfully run CUDA programs on a GeForce GTX 750 Ti while using a AMD Radeon HD 7900 as the rendering device (actually connected to the display) using this guide; for instance, the Vector Addition sample runs nicely. However, I can only run applications that do not produce visual output. For example, the Mandelbrot CUDA sample does not run and fails with an error:

Error: failed to get minimal extensions for demo:
  Missing support for:  GL_ARB_pixel_buffer_object
This sample requires:
  OpenGL version 1.5
  GL_ARB_vertex_buffer_object
  GL_ARB_pixel_buffer_object

The error originates from asking glewIsSupported() for these extensions. Is there any way to run an application, like these CUDA samples, so that the CUDA operations are run on the GTX as usual but the Window is drawn on the Radeon card? I tried to convince Nsight Eclipse to run a remote debugging session, with my own PC as the remote host, but something else failed right away. Is this supposed to actually work? Could it be possible to use VirtualGL?

Solution

Some of the NVIDIA CUDA samples that involve graphics, such as the Mandelbrot sample, implement an efficient rendering strategy: they bind OpenGL data structures - Pixel Vertex Objects in the case of Mandelbrot - to the CUDA arrays containing the simulation data and render them directly from the GPU. This avoids copying the data from the device to the host at end of each iteration of the simulation, and results in a lightning fast rendering phase.

To answer your question: NVIDIA samples as they are need to run the rendering phase on the same GPU where the simulation phase is executed, otherwise, the GPU that handles the graphics would not have the data to be rendered in its memory.

This does not exclude that the samples can be modified to work with multiple GPUs. It should be possible to copy the simulation data back to the host at end of each iteration, and then render it using a custom method or even send it over the network. This would require to (1) modify the code, by separating and making independent simulation and rendering phases, and (2) accept the big loss in frame per second that would result from this.