Automatic CudaMat conversion in Python

I'm looking into speeding up my python code, which is all matrix math, using some form of CUDA. Currently my code is using Python and Numpy, so it seems like it shouldn't be too difficult to rewrite it using something like either PyCUDA or CudaMat.

However, on my first attempt using CudaMat, I realized I had to rearrange a lot of the equations in order to keep the operations all on the GPU. This included the creation of many temporary variables so I could store the results of the operations.

I understand why this is necessary, but it makes what were once easy to read equations into somewhat of a mess that difficult to inspect for correctness. Additionally, I would like to be able to easily modify the equations later on, which isn't in their converted form.

The package Theano manages to do this by first creating a symbolic representation of the operations, then compiling them to CUDA. However, after trying Theano out for a bit, I was frustrated by how opaque everything was. For example, just getting the actual value for myvar.shape[0] is made difficult since the tree doesn't get evaluated until much later. I would also much prefer less of a framework in which my code much conform to a library that acts invisibly in the place of Numpy.

Thus, what I would really like is something much simpler. I don't want automatic differentiation (there are other packages like OpenOpt that can do that if I require it), or optimization of the tree, but just a conversion from standard Numpy notation to CudaMat/PyCUDA/somethingCUDA. In fact, I want to be able to have it evaluate to just Numpy without any CUDA code for testing.

I'm currently considering writing this myself, but before even consider such a venture, I wanted to see if anyone else knows of similar projects or a good starting place. The only other project I know that might be close to this is SymPy, but I don't know how easy it would be to adapt to this purpose.

My current idea would be to create an array class that looked like a Numpy.array class. It's only function would be to build a tree. At any time, that symbolic array class could be converted to a Numpy array class and be evaluated (there would also be a one-to-one parity). Alternatively, the array class could be traversed and have CudaMat commands be generated. If optimizations are required they can be done at that stage (e.g. re-ordering of operations, creation of temporary variables, etc.) without getting in the way of inspecting what's going on.

Any thoughts/comments/etc. on this would be greatly appreciated!

Update

A usage case may look something like (where sym is the theoretical module), where we might be doing something such as calculating the gradient:

W = sym.array(np.rand(size=(numVisible, numHidden)))
delta_o = -(x - z)
delta_h = sym.dot(delta_o, W)*h*(1.0-h)
grad_W = sym.dot(X.T, delta_h)

In this case, grad_W would actually just be a tree containing the operations that needed to be done. If you wanted to evaluate the expression normally (i.e. via Numpy) you could do:

npGrad_W = grad_W.asNumpy()

which would just execute the Numpy commands that the tree represents. If on the other hand, you wanted to use CUDA, you would do:

cudaGrad_W = grad_W.asCUDA()

which would convert the tree into expressions that can executed via CUDA (this could happen in a couple of different ways).

That way it should be trivial to: (1) test grad_W.asNumpy() == grad_W.asCUDA(), and (2) convert your pre-existing code to use CUDA.

Solution

Have you looked at the GPUArray portion of PyCUDA?

http://documen.tician.de/pycuda/array.html

While I haven't used it myself, it seems like it would be what you're looking for. In particular, check out the "Single-pass Custom Expression Evaluation" section near the bottom of that page.