Alternative reasons for "out of memory" error than lack of free global memory?

As the title says, I am getting "out of memory" errors (via standard CUDA error checking) after cudaCreateTextureObject call, however when I print the amount of free memory on the device, its almost all free (11GB/12GB). Interestingly, this only arises when I call my CUDA code from python/MATLAB after ~34K times, with exactly the same inputs. My worry is that there is some other memory that I am not aware of that is getting filled and not freed. Alternatively this error is being caused/thrown by something else, which leads to the question in the title.

I'd be happy with a good answer to the question in the title, but I think its best if I put the whole context of my issue. This is what comes now:

Moderately more details about the code

Now, bear with me while I share more detail. I have been unable to reproduce this out of my complex piece of code, my attempts for a Minimal Example have been in vain. The question in the title stands on its own without code, as I can get the error with correct code with enough free global memory. However, more context on the code may help.

This out of memory error arises in my texture memory allocation. The function is exactly as:

void CreateTexture(const GpuIds& gpuids, float* projectiondata,Geometry geo,cudaArray** d_cuArrTex,unsigned int nangles, cudaTextureObject_t *texImage,cudaStream_t* stream,int nStreamDevice,bool allocate){


    const cudaExtent extent =make_cudaExtent(geo.nDetecU, geo.nDetecV, nangles);
    const unsigned int num_devices = gpuids.GetLength();
    size_t memfree;
    size_t memtotal;
    if (allocate){
        for (unsigned int dev = 0; dev < num_devices; dev++){
            cudaSetDevice(gpuids[dev]);
            cudaDeviceSynchronize();
            cudaCheckErrors("before cudaMalloc3DArray fail");

            //cudaArray Descriptor
            cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();
            //cuda Array
            cudaMalloc3DArray(&d_cuArrTex[dev], &channelDesc, extent);
            cudaDeviceSynchronize();
            cudaCheckErrors("cudaMalloc3DArray fail");
        }
    }
    for (unsigned int dev = 0; dev < num_devices; dev++){
        cudaSetDevice(gpuids[dev]);
        cudaMemcpy3DParms copyParams = {0};
        //Array creation
        copyParams.srcPtr   = make_cudaPitchedPtr((void *)projectiondata, extent.width*sizeof(float), extent.width, extent.height);
        copyParams.dstArray = d_cuArrTex[dev];
        copyParams.extent   = extent;
        copyParams.kind     = cudaMemcpyHostToDevice;
        cudaMemcpy3DAsync(&copyParams,stream[dev*nStreamDevice+1]);
        cudaDeviceSynchronize();
        cudaCheckErrors("cudaMemcpy3DAsync fail");
    }

    //Array creation End
    for (unsigned int dev = 0; dev < num_devices; dev++){
        
        cudaSetDevice(gpuids[dev]);
        //cudaDeviceSynchronize();
        //cudaCheckErrors("cudaCreateTextureObject init fail");
        //cudaMemGetInfo(&memfree,&memtotal);
        //printf("Free memory: %zu\n",memfree);

        cudaResourceDesc    texRes;
        memset(&texRes, 0, sizeof(cudaResourceDesc));
        texRes.resType = cudaResourceTypeArray;
        texRes.res.array.array  = d_cuArrTex[dev];
        cudaTextureDesc     texDescr;
        memset(&texDescr, 0, sizeof(cudaTextureDesc));
        texDescr.normalizedCoords = false;
        texDescr.filterMode = cudaFilterModeLinear;
        texDescr.addressMode[0] = cudaAddressModeBorder;
        texDescr.addressMode[1] = cudaAddressModeBorder;
        texDescr.addressMode[2] = cudaAddressModeBorder;
        texDescr.readMode = cudaReadModeElementType;
        cudaCreateTextureObject(&texImage[dev], &texRes, &texDescr, NULL);

        //cudaMemGetInfo(&memfree,&memtotal);
        //printf("Free memory: %zu\n",memfree);
        cudaDeviceSynchronize();
        cudaCheckErrors("cudaCreateTextureObject fail");
    }

Geometry is just a struct containing the metadata required for sizes etc.
GpuIds is a class that contains which GPUs is this code allowed to use. Not relevant for this bug, as its happening in a 1 GPU PC.

Otherwise this is a standard texture object creation, copying and allocation code, with the only quirks being that it allows for multi-GPU code (but the error has been reproduced on 1 GPU machines only, so far) and that there is a boolean allocate that choses if there is need to allocate the 3D array or not.

I have used this code for years and it seems to run well. It lives in a function that is called via mex files or cython files from MATLAB or Python respectively, and it should allocate memory, do compute, and completely free the GPU. Recently, someone noticed that after running this code in a loop for thousands of independent calls (the number is always the same for the same input size, but differs if the input size is changed), the code crashes with "out of memory", and I pinpointed that to the last cudaCheckErrors in the code above.

However, the way we are calling the code uses only 1GB out of the 12GB available, and monitoring all memory I can not see an increase anywhere. This leads me to believe that the error is not an "out of memory" for global memory necessarily. I wonder if there is some specific array that I should be freeing that I am not, or that I am filling a different memory (shared? constant? (I don't think these two make sense)) accidentally. Debugging this is harder, of course, and out of scope if this question probably, as its a complex code. But I am now in the dark, as the only information I have is not helpful with my current knowledge ("out of memory")[^1].

Further evidence that I am somehow filling some memory that its not global is that if I cudaDeviceReset() after each call, this error disappears, at the cost of longer execution time.

Full reproducibility

Tested in 2 different machines causing an error in exactly the same iteration number.

Install TIGRE. The file in question that errors is voxel_backprojection.cu , on lines 667-714 (the ones in the code above). You can reproduce this error with the following python code:

import numpy as np
import tigre
from skimage.data import shepp_logan_phantom
from tqdm import tqdm

def main():
    
    gt = shepp_logan_phantom().astype(np.float32)[None, ...]
    domain = gt.shape
    NANGLES = 1000
    angles = np.linspace(0, 2 * np.pi, NANGLES)
  
    x = np.zeros(domain, dtype=np.float32)
    geo = tigre.geometry(mode="fan", nVoxel=np.array(x.shape))
    ys = tigre.Ax(gt, geo, angles)
  

    max_iterations=1000*49 # Make it longer if the error doesnt happen in your GPU. Takes ~10 minutes in my machine to crash. 
    print(ys.shape[0])
    for k in tqdm(range(max_iterations), leave=False):
        x = tigre.Atb(ys, geo, angles)
            
if __name__ == "__main__":
    main()

This fails in my machine at iteration 34679 always.

Using CUDA V12.3.103 on a RTX 4070

[^1] And "out of memory" is the worst keywords to search for in CUDA, as its all errors of people genuinely allocating too much memory!

Solution

The usual scenario for this would be memory fragmentation, i.e. even though you have free memory, the memory manager can’t find or create a contiguous slab of memory large enough to satisfy your request.

Having said that, this part:

This fails in my machine at iteration 34679 always.

makes me suspicious that there is something else going on, like running out of resource handles or some counter overflowing or something else within the runtime memory manager. I would absolutely be reporting it as bug to NVIDIA just to be sure.

The countermeasure in either scenario is not to thrash the memory manager.

Don’t repeatedly allocate and free memory. Either allocate a memory buffer in an initialisation step at the beginning of you code and keep that allocation alive as runtime state for the lifetime of your application (lots of evidence that CUFFT and CUBLAS do exactly this), or use the stream ordered allocator and operate with a runtime managed memory pool.

However you do it, you should find that memory reuse (a) alleviates your immediate problem and (b) improves the performance of your code.