Search code examples
cudajitjcuda

Loading multiple modules in JCuda is not working


In jCuda one can load cuda files as PTX or CUBIN format and call(launch) __global__ functions (kernels) from Java.

With keeping that in mind, I want to develop a framework with JCuda that gets user's __device__ function in a .cu file at run-time, loads and runs it. And I have already implemented a __global__ function, in which each thread finds out the start point of its related data, perform some computation, initialization and then call user's __device__ function.

Here is my kernel pseudo code:

extern "C" __device__ void userFunc(args);
extern "C" __global__ void kernel(){

    // initialize

    userFunc(args);

    // rest of the kernel
}

And user's __device__ function:

extern "C" __device__ void userFunc(args){
    // do something
}

And in Java side, here is the part that I load the modules(modules are made from ptx files which are successfully created from cuda files with this command: nvcc -m64 -ptx path/to/cudaFile -o cudaFile.ptx)

CUmodule kernelModule = new CUmodule(); // 1 
CUmodule userFuncModule = new CUmodule(); // 2
cuModuleLoad(kernelModule, ptxKernelFileName); // 3 
cuModuleLoad(userFuncModule, ptxUserFuncFileName); // 4

When I try to run it I got error at line 3 : CUDA_ERROR_NO_BINARY_FOR_GPU. After some searching I get that my ptx file has some syntax error. After running this suggested command:

ptxas -arch=sm_30 kernel.ptx

I got:

ptxas fatal : Unresolved extern function 'userFunc'

Even when I replace line 3 with 4 to load userFunc before kernel I get this error. I got stuck at this phase. Is this the correct way to load multiple modules that need to be linked together in JCuda? Or is it even possible?

Edit:

Second part of the question is here


Solution

  • The really short answer is: No, you can't load multiple modules into a context in the runtime API.

    You can do what you want, but it requires explicit setup and execution of a JIT linking call. I have no idea how (or even whether) that has been implemented in JCUDA, but I can show you how to do it with the standard driver API. Hold on...

    If you have a device function in one file, and a kernel in another, for example:

    // test_function.cu
    #include <math.h>
    __device__ float mathop(float &x, float &y, float &z)
    {
            float res = sin(x) + cos(y) + sqrt(z);
            return res;
    }
    

    and

    // test_kernel.cu
    extern __device__ float mathop(float & x, float & y, float & z);
    
    __global__ void kernel(float *xvals, float * yvals, float * zvals, float *res)
    {
    
            int tid = threadIdx.x + blockIdx.x * blockDim.x;
    
            res[tid] = mathop(xvals[tid], yvals[tid], zvals[tid]);
    }
    

    You can compile them to PTX as usual:

    $ nvcc -arch=sm_30 -ptx test_function.cu
    $ nvcc -arch=sm_30 -ptx test_kernel.cu
    $ head -14 test_kernel.ptx
    //
    // Generated by NVIDIA NVVM Compiler
    //
    // Compiler Build ID: CL-19324607
    // Cuda compilation tools, release 7.0, V7.0.27
    // Based on LLVM 3.4svn
    //
    
    .version 4.2
    .target sm_30
    .address_size 64
    
            // .globl       _Z6kernelPfS_S_S_
    .extern .func  (.param .b32 func_retval0) _Z6mathopRfS_S_
    

    At runtime, your code must create a JIT link session, add each PTX to the linker session, then finalise the linker session. This will give you a handle to a compiled cubin image which can be loaded as a module as usual. The simplest possible driver API code to put this together looks like this:

    #include <cstdio>
    #include <cuda.h>
    
    #define drvErrChk(ans) { drvAssert(ans, __FILE__, __LINE__); }
    
    inline void drvAssert(CUresult code, const char *file, int line, bool abort=true)
    {
        if (code != CUDA_SUCCESS) {
            fprintf(stderr, "Driver API Error %04d at %s %d\n", int(code), file, line);
            exit(-1);
        }
    }
    
    int main()
    {
        cuInit(0);
    
        CUdevice device;
        drvErrChk( cuDeviceGet(&device, 0) );
    
        CUcontext context;
        drvErrChk( cuCtxCreate(&context, 0, device) );
    
        CUlinkState state;
        drvErrChk( cuLinkCreate(0, 0, 0, &state) );
        drvErrChk( cuLinkAddFile(state, CU_JIT_INPUT_PTX, "test_function.ptx", 0, 0, 0) );
        drvErrChk( cuLinkAddFile(state, CU_JIT_INPUT_PTX, "test_kernel.ptx" , 0, 0, 0) );
    
        size_t sz;
        char * image;
        drvErrChk( cuLinkComplete(state, (void **)&image, &sz) );
    
        CUmodule module;
        drvErrChk( cuModuleLoadData(&module, image) );
    
        drvErrChk( cuLinkDestroy(state) );
    
        CUfunction function;
        drvErrChk( cuModuleGetFunction(&function, module, "_Z6kernelPfS_S_S_") );
    
        return 0;
    }
    

    You should be able to compile and run this as posted and verify it works OK. It should serve as a template for a JCUDA implementation, if they have JIT linking support implemented.