I tried to run a PTX assembly code generated by a .cl kernel with the CUDA driver API. The steps i took were these ( standard opencl procedure ):
1) Load .cl kernel
2) JIT compile it
3) Get the compiled ptx code and save it.
So far so good.
I noticed some special registers inside ptx assembly, %envreg3, %envreg6 etc. The problem is that these registers are not set ( according to ptx_isa these registers are set by the driver before the kernel launch ) when i try to execute the code with the driver API. So the code is falling into an infinite loop, and fails to run corectly. But if i manually set the values ( nore exactly i replace %envreg6 with the blocksize inside ptx ), the code is executing and i get the correct results ( correct compared with the cpu results ).
Does anyone know how we can set values to these registers, or maybee if i am missing something? i.e. a flag on cuLaunchKernel, that sets values to these registers?
You are trying to compile an OpenCL kernel and run it using the CUDA driver API. The NVIDIA driver/compiler interface is different between OpenCL and CUDA, so what you want to do is not supported and fundamentally cannot work.
Presumably, the only workaround would be the one you found: to patch the PTX code. But I'm afraid this might not work in the general case.
Edit: Specifically, OpenCL supports larger grids than most NVIDIA GPUs support, so grid sizes need to be virtualized by dividing across multiple actual grid launches, and so offsets are necessary. Also in OpenCL, indices do not necessarily start from (0, 0, 0), the user can specify offsets which the driver must pass to the kernel. Therefore the registers initialized for OpenCL and CUDA C launches are different.