This issue reminds some typical many-body problem, but with some extra calculations.
I am working on the generalized Metropolis Monte-Carlo algorithm for the modeling of large number of arbitrary quantum systems (magnetic ions for example) interacting classically with each other. But it actually doesn't matter for the question.
There is more than 100000 interacting objects, each one can be described by a coordinate and a set of parameters describing its current state r_i
, s_i
.
Can be translated to the C++CUDA as
float4
andfloat4
vectors
To update the system following Monte-Carlo method for such systems, we need to randomly sample 1 object from the whole set; calculate the interaction function for it f(r_j - r_i, s_j)
; substitute to some matrix and find eigenvectors of it, from which one a new state will be calculated.
The interaction is additive as usual, i.e. the total interaction will be the sum between all pairs.
Formally this can be decomposed into steps
f(r_j - r_i, s_j)
F
h = h + dot(F,t)
. Some basic linear algebra stuff.V_k
and write in back to the array s_j
of all objects's states.There is a big question, which parts of this can be computed on CUDA kernels.
I am quite new to CUDA programming. So far I ended up with the following algorithm
//a good random generator
std::uniform_int_distribution<std::mt19937::result_type> random_sampler(0, N-1);
for(int i=0; i\<a_lot; ++i) {
//sample a number of object
nextObject = random_sampler(rng);
//call kernel to calculate the interaction and sum it up by threads. also to write down a new state back to the d_s array
CUDACalcAndReduce<THREADS><<<blocksPerGrid, THREADS>>>(d_r, d_s, d_sum, newState, nextObject, previousObject, N);
//copy the sum
cudaMemcpy(buf, d_sum, sizeof(float)*4*blocksPerGrid, cudaMemcpyDeviceToHost);
//manually reduce the rest of the sum
total = buf[0];
for (int i=1; i<blocksPerGrid; ++i) {
total += buf[i];
}
//find eigenvalues and etc. and determine a new state of the object
//just linear algebra with complex numbers
newState = calcNewState(total);
//a new state will be written by CUDA function on the next iteration
//remember the previous number of the object
previousObject = nextObject;
}
The problem is continuous transferring data between CPU and GPU, and the actual number of bytes is blocksPerGrid*4*sizeof(float)
which sometimes is just a few bytes. I optimized CUDA code following the guide from NVIDIA and now it limited by the bus speed between CPU and GPU. I guess switching to pinned
memory type will not make any sense since the number of transferred bytes is low.
I used Nvidia Visual Profiler and it shows the following
the most time was waisted by the transferring the data to CPU. The speed as one can see by the inset is 57.143 MB/s
and the size is only 64B
!
The question is is it worth to move the logic of eigenvalues algorithm to CUDA kernel?
Therefore there will be no data transfer between CPU and GPU. The problem with this algorithm, you can update only one object per iteration. It means that I can run the eigensolver only on one CUDA core. ;( Will it be that slow compared to my CPU, that will eliminate the advantage of keeping data inside the GPU ram?
The matrix size for the eigensolver algorithm does not exceed 10x10 complex numbers. I've heard that cuBLAS can be run fully on CUDA kernels without calling the CPU functions, but not sure how it is implemented.
UPD-1
As it was mentioned in the comment section.
For the each iteration we need to diagonalize only one 10x10 complex Hermitian matrix, which depends on the total calculated interaction function f
. Then, we in general it is not allowed to a compute a new sum of f
, before we update the state of the sampled object based on eigenvectors and eigenvalues of 10x10 matrix.
Due to the stochastic nature of Monte-Carlo approach we need all 10 eigenvectors to pick up a new state for the sampled object.
However, the suggested idea of double-buffering (in the comments) can work out in a way if we calculate the total sum of f
for the next j-th
iteration without the contribution of i-th
sampled object and, then, add it later. I need to test it carefully in action...
UPD-2 The specs are
quite outdated, but I might find an access to the better system. However, switching to GTX1660 SUPER did not affect the performance, which means that a PCI bus is a bottleneck ;)
The question is is it worth to move the logic of eigenvalues algorithm to CUDA kernel?
Depends on the system. Old cpu + new gpu? Both new? Both old?
Generally single cuda thread is a lot slower than single cpu thread. Because cuda compiler does not vectorize its loops but host c++ compiler vectorizes. So, you need to use 10-100 cuda threads to make the comparison fair.
For the optimizations:
According to the image, currently it loses 1 microsecond as a serial part of overall algorithm. 1 microsecond is not much compared to the usual kernel-launch latency from CPU but is big when it is GPU launching the kernel (dynamic parallelism) itself.
CUDA-graph feature enables the overall algorithm re-launch every step(kernel) automatically and complete quicker if steps are not CPU-dependent. But it is intended for "graph"-like workloads where some kernel leads to multiple kernels and they later join in another kernel, etc.
CUDA-dynamic-parallelism feature lets a kernel's cuda threads launch new kernels. This has much better timings than launching from CPU due to not waiting for the synchronizations between driver and host.
Sampling part's copying could be made in chunks like 100-1000 elements at once and consumed by CUDA part at once for 100-1000 steps if all parts are in CUDA.
If I were to write it, I would do it like this:
launch a loop kernel (only 1 CUDA thread) that is parent
start loop in the kernel
do real (child) kernel-launching within the loop
since every iteration needs serial, it should sync before continuing next iteration.
end the parent after 100-1000 sized chunk is complete and get new random data from CPU
when parent kernel ends, it shows in profiler as a single kernel launch that takes a lot of time and it doesn't have any CPU-based inefficiencies.
On top of the time saved from not synching a lot, there would be consistency of performance between 10x10 matrix part and the other kernel part because they are always in same hardware, not some different CPU and GPU.
Since random-num generation is always an input for the system, at least it can be double-buffered to hide cpu-to-gpu data copying latency behind the computation. Iirc, random number generation is much cheaper than sending data over pcie bridge. So this would hide mostly the data transmission slowness.
If it is a massively parallel experiment like running the executable N times, you can still launch like 10 executable instances at once and let them keep gpu busy with good efficiency. Not practical if too much memory is required per instance. Many gpus except ancient ones can run tens of kernels in parallel if each of them can not fully occupy all resources of gpu.