How to use the GPU to speed up the Pymc3 sampling?

I've used the 'njobs' parameter to get the multi-sample results, and it's far away from my expection
I've changed the '.theanorc' file to set the 'floatX', 'cnmem' value, etc.
I've monitored the GPU source by the command 'nvidia-smi', and it's well used

But, the sampling speed is already slow, even slower than the CPU.
Is that normal?

Solution

This sounds like a problem of convergence or model construction, not related to njobs or parallelism. Without the model or traces there is not a lot that can be said here.

GPU is still experimental and we've seen speed-ups for some models and slow-downs for others. ADVI seems to be easier to run on the GPU, though. You can also check that all your model types and input data are float32.