Why is the iterating range of thrust::reduce limited to 2048 doubles in device code?

I am using the NVIDIA HPC SDK (2022) to compile the following code, the basic purpose of which is to sum a NxM matrix into a vector of size N.

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/execution_policy.h>
#include <thrust/execution_policy.h>
#include <thrust/fill.h>

constexpr unsigned int N = 2048, M = 2048;

int main(int argc, char* argv[]) {

    thrust::device_vector<double> g_vec1(N*M);
    thrust::device_vector<double> g_vec2(N);
    thrust::fill(thrust::device, g_vec1.begin(),g_vec1.end(),1.);

    
    thrust::device_vector<thrust::
            device_vector<double>::iterator> g_it_vec(N);

    for (int i=0; i<N; i++)
        g_it_vec[i] = g_vec1.begin() + i*M;

            
    thrust::transform(g_it_vec.begin(),g_it_vec.end(),g_vec2.begin(),
        [](const auto& it) {
            return thrust::reduce(thrust::device,
                it, it+M,0.);});

}

When I run this code on a Geforce RTX 3080Ti, an error occurs for M > 2048 doubles (or M > 1024 when I use complex doubles):

temporary_buffer::allocate: get_temporary_buffer failed
…
temporary_buffer::allocate: get_temporary_buffer failed
terminate called after throwing an instance of ‘thrust::system::system_error’
what(): transform: failed to synchronize: cudaErrorLaunchFailure: unspecified launch failure
Aborted (core dumped)

How did this happen? Is it related to the 1024 maximum thread number of a block?

Or is there any standard means to reduce a matrix (2d array) along the inner dimension?

Solution

Context

thrust::reduce uses cub::DeviceReduce in the backend. The documentation says that

DeviceReduce methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.

When Thrust allocates scratch space for DeviceReduce with cudaMalloc in device code, it probably runs into size limitations:

cudaMalloc() and cudaFree() have distinct semantics between the host and device environments. [...] When invoked from the device runtime these functions map to device-side malloc() and free(). This implies that within the device environment the total allocatable memory is limited to the device malloc() heap size, which may be smaller than the available unused device memory.

Solution Candidate 1: Increase `cudaLimitMallocHeapSize`

This limit can be set with cudaDeviceSetLimit(cudaLimitMallocHeapSize, value) which should enable bigger reductions.

Solution Candidate 2: Use Custom Allocators (untested)

When calling Thrust algorithms on the host, one can avoid temporary allocations for scratch space by passing a custom allocator to the execution policy as demonstrated in examples/cuda/custom_temporary_allocation.cu. I have never tried to do this in device code so I can't guarantee that this is implemented, but one could try to do the allocations on the host, wrap them in an allocator on the device and pass them to thrust::device. If this works, doing a single big allocation on the host should be much more efficient than doing many (N) small allocations on the device.

Solution Candidate 3: Avoid Using CUDA Dynamic Parallelism (preferred)

You can find multiple alternative ways of implementing this reduction in my answers to How to do a reduction over one dimension of 2D data in Thrust and Parallelization of a for loop consisting of Thrust Transforms. In short I would recommend using cub::DeviceSegmentedReduce (from the host).

I prefer avoiding CUDA Dynamic Parallelism (CDP) for this problem not only because it is deprecated* in Thrust >= 1.15, but also because I do not expect it to give good performance for problems which do not need CDP due to the amount of parallelism being data dependent**. The very regular problem of reducing a dimension of a dense matrix does not need Dynamic Parallelism and therefore performance should not have to suffer from CDP overheads.

*: CUDA 12 introduced a new CDP API disallowing cudaDeviceSynchronize() in device code which Thrust needs to keep usage consistent between host and device. The Thrust CDP deprecation notice says

A future version of Thrust will remove support for CUDA Dynamic Parallelism (CDP). This will only affect calls to Thrust algorithms made from CUDA device-side code that currently launches a kernel; such calls will instead execute sequentially on the calling GPU thread instead of launching a device-wide kernel.

I.e. thrust::device will just become the same as thrust::seq when used in device code. This would probably solve the issue in terms of a sequential reduction not needing the additional scratch space, but performance would be bad to to the reduced amount of parallelism.

**: I.e. the ideal number of threads not being known at kernel launch on the host.

Why is the iterating range of thrust::reduce limited to 2048 doubles in device code?

Context

Solution Candidate 1: Increase cudaLimitMallocHeapSize

Solution Candidate 2: Use Custom Allocators (untested)

Solution Candidate 3: Avoid Using CUDA Dynamic Parallelism (preferred)

Solution Candidate 1: Increase `cudaLimitMallocHeapSize`