Pycuda: Best way of calling Kernel multiple times

I'm using pycuda to make a relativistic raytracer. Basically, for each "pixel" in a big 2D array we must solve a system of 6 ODEs using Runge Kutta. As each integration is independent of the rest it should be very easy. Other people has achieve it using C/C++ CUDA with excellent results (see this project).

The problem arises in the fact that I do not know how is the best way of doing this. I'm writing a Kernel that does some Runge Kutta Steps and then return the results to the CPU. This Kernel is called a lot of times in order to get the whole ray integrated. The problem is for some reason is very slow. Of course, I know that memory transfers are really a Bottleneck in CUDA, but as this is really slow I'm starting to think that I'm doing something wrong.

It would be great if you can recommend me best programming practices for this case. (Using pycuda). Some things that I'm wandering:

Do I need to create a new context on reach Kernel call?
There is a way to not have to transfer memory from GPU to CPU, that is, starting a Kernel, pausing it to get some information, restating it and repeat.
Each RK4 iteration takes roughly half a second, which is insane (also compared with the CUDA code in the link that does some similar operation). And I think this is due to something wrong with the way I'm using pycuda, so if you can explain the best way to do such an operation in the best manner, it could be great!.

To clarify: the reason I have to pause/restart the Kernel is because of the watchdog. Kernels of more than 10 seconds got killed by the watchdog.

Thank you in advance!

Solution

Your main question seems to be too general, and it's hard to give some concrete advice without seeing the code. I'll try to answer your subquestions (not an actual answer, but it's a bit long for a comment)

Do I need to create a new context on reach Kernel call?

No.

There is a way to not have to transfer memory from GPU to CPU, that is, starting a Kernel, pausing it to get some information, restating it and repeat.

Depends on what you mean by "get some information". If it means doing stuff with it on CPU, then, of course, you have to transfer it. If you want to use in another kernel invocation, then you don't need to transfer it.

Each RK4 iteration takes roughly half a second, which is insane (also compared with the CUDA code in the link that does some similar operation).

It really depends on the equation, the number of threads and the video card you are using. I can imagine a situation when one RK step would take that long.

And I think this is due to something wrong with the way I'm using pycuda, so if you can explain the best way to do such an operation in the best manner, it could be great!.

Impossible to say for sure without the code. Try to create some minimal demonstrating example, or, at the very least, post a link to a runnable (even if it's rather long) piece of code that illustrates your problem. As for PyCUDA, it's a very thin wrapper over CUDA, and all the programming practises that apply to the latter, apply to the former as well.