Is it possible to pre-allocate a variable to CPU/GPU memory in the MexGateway code written in Visual Studio?

I'm trying to write a MexGateway code to pass two variables in matlab to the compiled MexFile, copy the variables to a cuda kernel, do the processing and bring back the results to Matlab. I need to use this MexFile in a for loop in matlab.

The problem is that: The two inputs are huge for my application and ONLY one of them (called Device_Data in the following code) is changing in each loop. So, I'm looking for a way to pre-allocate the stable input so that it does not remove from the GPU at each iteration of my for loop. I also need to say that I really need to do it in my visual studio code and make this happen in the MexGateway code (I do not want to do it in Matlab). is there any solution for this?

Here is my code (I have already compiled it. It works fine):

    #include <cuda_runtime.h>
    #include "device_launch_parameters.h"
    #include <stdio.h>
    #include "cuda.h"
    #include <iostream>
    #include <mex.h>
    #include "MexFunctions.cuh"




    __global__ void add (int* Device_Data, int* Device_MediumX, int N) {
    int TID = threadIdx.y * blockDim.x + threadIdx.x;
    if (TID < N) {
        for (int i = 0; i < N; i++) {
            Device_Data[i] = Device_Data[i] + Device_MediumX[i];
        }
    }
    }
    void mexFunction(int nlhs, mxArray* plhs[],
    int nrhs, const mxArray* prhs[]) {

    int N = 128;
    int* MediumX;
    int* Data;
    int* Data_New;

    MediumX = (int*)mxGetPr(prhs[0]);
    Data = (int*)mxGetPr(prhs[1]);

    plhs[0] = mxCreateNumericMatrix(N,1, mxINT32_CLASS, mxREAL);
    Data_New = (int*)mxGetData(plhs[0]);


    int ArrayByteSize = sizeof(int) * N;
    int* Device_MediumX; // device pointer to the X coordinates of the medium
    gpuErrchk(cudaMalloc((int**)&Device_MediumX, ArrayByteSize));
    gpuErrchk(cudaMemcpy(Device_MediumX, MediumX, ArrayByteSize, cudaMemcpyHostToDevice));

    int* Device_Data; // device pointer to the X coordinates of the medium
    gpuErrchk(cudaMalloc((int**)&Device_Data, ArrayByteSize));
    gpuErrchk(cudaMemcpy(Device_Data, Data, ArrayByteSize, cudaMemcpyHostToDevice));

    dim3 block(N, 1);
    dim3 grid(1);//SystemSetup.NumberOfTransmitter
    add << <grid, block >> > (Device_Data, Device_MediumX, N);

    (cudaMemcpy(Data_New, Device_Data, ArrayByteSize, cudaMemcpyDeviceToHost));


    cudaDeviceReset();

    }

Solution

Yes it is possible, as long as you have the Distributed Computing Toolbox/Parallel computing toolbox of MATLAB.

The toolbox allows to have a thing called gpuArrays in normal MATLAB code, but it also has a C interface where you can get and set these MATLAB arrays GPU addresses.

You can find the documentation here:

https://uk.mathworks.com/help/parallel-computing/gpu-cuda-and-mex-programming.html?s_tid=CRUX_lftnav

For example, for the first input to a mex file:

mxGPUArray const *dataHandler= mxGPUCreateFromMxArray(prhs[0]); // Can be CPU or GPU, will copy to GPU if its not already there
float  *  d_data = static_cast<float  *>( (float *)mxGPUGetDataReadOnly(dataHandler)); // get the pointer itself (assuming float data)