Search code examples
c++cudathrust

CUDA Thrust memory allocation issue


I have a Thrust code which loads a big array of data (2.4G) into memory, perform calculations which results are stored in host (~1.5G), then frees the initital data, load the results into device, perform other calculations on it, and finally reloads the initial data. The thrust code looks like this:

thrust::host_device<float> hostData;
// here is a code which loads ~2.4G of data into hostData
thrust::device_vector<float> deviceData = hostData;
thrust::host_vector<float> hostResult;
// here is a code which perform calculations on deviceData and copies the result to hostResult (~1.5G)
free<thrust::device_vector<float> >(deviceData);
thrust::device_vector<float> deviceResult = hostResult;
// here is code which performs calculations on deviceResult and store some results also on the device
free<thrust::device_vector<float> >(deviceResult);
deviceData = hostData;

With my defined function free:

template<class T> void free(T &V) {
    V.clear();
    V.shrink_to_fit();
    size_t mem_tot;
    size_t mem_free;
    cudaMemGetInfo(&mem_free, &mem_tot);
    std::cout << "Free memory : " << mem_free << std::endl;
}

template void free<thrust::device_vector<int> >(thrust::device_vector<int>& V);
template void free<thrust::device_vector<float> >(
    thrust::device_vector<float>& V);

However, I get a "thrust::system::detail::bad_alloc' what(): std::bad_alloc: out of memory" error when trying to copy hostData back to deviceData even though cudaMemGetInfo returns that at this point I have ~6G of free memory of my device. Here is the complete output from the free method:

Free memory : 6295650304
Free memory : 6063775744
terminate called after throwing an instance of 'thrust::system::detail::bad_alloc'
what():  std::bad_alloc: out of memory

It seems to indicate that the device is out of memory although there is plenty free. Is it the right way to free memory for Thrust vectors? I should also note that the code works well for a smaller size of data (up to 1.5G)


Solution

  • It would be useful to see a complete, compilable reproducer code. However you're probably running into memory fragmentation.

    Even though a large amount of memory may be reported as being free, it's possible that it can't be allocated in a single large contiguous chunk. This fragmentation will then limit the maximum size of a single allocation that you can request.

    It's probably not really a question of how you are freeing memory, but more a function of what overhead allocations remain after you free the memory. The fact that you are checking the mem info and getting a large number back says to me that you are freeing your allocations correctly.

    To try to work around this, one approach would be to manage and re-use your allocations carefully. For instance, if you need a large 2.4G working device vector of float on the device, then allocate that once, and re-use it for successive operations. Also, if you have any remaining allocations on the device immediately before you are trying to re-allocate the 2.4G vector, then try freeing those (i.e. free all allocations you have made on the device) before trying to re-allocate the 2.4G vector.