im working on a fluid simulator using opengl (implementing the sph algorithm). I've tried many methods to run my simulator, first i used octrees, after that hashmaps and now i am trying to use Z order, and for that i need to order my particles based on their index.
What i am having some trouble understanding is the fact that if i have one thrust::sort it takes 15 miliseconds, if i have two thrust::sort it takes 17 miliseconds.
For more clarification, i am doing my simulator in opengl (all my buffers are created using opengl), and i use cuda interops in order to sort my buffers with thrust, that uses cuda.
This is the part where i get my buffers and "link" them to cuda
//I use this if to do the registerBuffer only one time
if (first == 0) {
//index
IBuffer* bIndex = RESOURCEMANAGER->getBuffer("particleLib::Index");
int buffIdIndex = bIndex->getPropi(IBuffer::ID);
//Position
IBuffer* bPosition = RESOURCEMANAGER->getBuffer("particleLib::Position");
int buffIdPosition = bPosition->getPropi(IBuffer::ID);
//TempIndex
IBuffer* bTempIndex = RESOURCEMANAGER->getBuffer("particleLib::TempIndex");
int buffIdTempIndex = bTempIndex->getPropi(IBuffer::ID);
//Velocity
IBuffer* bVelocity = RESOURCEMANAGER->getBuffer("particleLib::Velocity");
int buffIdVelocity = bVelocity->getPropi(IBuffer::ID);
// register this buffer object with CUDA
//So devia chamar isto uma vez
cudaGraphicsGLRegisterBuffer(&cuda_ssbo_Index, buffIdIndex, cudaGraphicsMapFlagsNone);
cudaGraphicsGLRegisterBuffer(&cuda_ssbo_TempIndex, buffIdTempIndex, cudaGraphicsMapFlagsNone);
cudaGraphicsGLRegisterBuffer(&cuda_ssbo_Position, buffIdPosition, cudaGraphicsMapFlagsNone);
cudaGraphicsGLRegisterBuffer(&cuda_ssbo_Velocity, buffIdVelocity, cudaGraphicsMapFlagsNone);
first = 1;
}
// map OpenGL buffer object for writing from CUDA
int* dptrssboIndex;
int* dptrssboTempIndex;
float4 * dptrssboPosition;
float4 * dptrssboVelocity;
cudaGraphicsMapResources(1, &cuda_ssbo_Index, 0);
cudaGraphicsMapResources(1, &cuda_ssbo_TempIndex, 0);
cudaGraphicsMapResources(1, &cuda_ssbo_Position, 0);
cudaGraphicsMapResources(1, &cuda_ssbo_Velocity, 0);
size_t num_bytesssbo_Index;
size_t num_bytesssbo_TempIndex;
size_t num_bytesssbo_Position;
size_t num_bytesssbo_Velocity;
cudaGraphicsResourceGetMappedPointer((void**)&dptrssboIndex, &num_bytesssbo_Index, cuda_ssbo_Index);
cudaGraphicsResourceGetMappedPointer((void**)&dptrssboTempIndex, &num_bytesssbo_TempIndex, cuda_ssbo_TempIndex);
cudaGraphicsResourceGetMappedPointer((void**)&dptrssboPosition, &num_bytesssbo_Position, cuda_ssbo_Position);
cudaGraphicsResourceGetMappedPointer((void**)&dptrssboVelocity, &num_bytesssbo_Velocity, cuda_ssbo_Velocity);
mysort(&dptrssboIndex,&dptrssboPosition, &dptrssboTempIndex, &dptrssboVelocity,216000);
cudaGraphicsUnmapResources(1, &cuda_ssbo_Index, 0);
cudaGraphicsUnmapResources(1, &cuda_ssbo_TempIndex, 0);
cudaGraphicsUnmapResources(1, &cuda_ssbo_Position, 0);
cudaGraphicsUnmapResources(1, &cuda_ssbo_Velocity, 0);
This is the code from mysort
void mysort(int ** index1, float4 ** values1, int** index2, float4 ** values2,int particles){
thrust::device_ptr<int> i1buff = thrust::device_pointer_cast(*(index1));
thrust::device_ptr<float4> v1buff = thrust::device_pointer_cast(*(values1));
thrust::device_ptr<int> i2buff = thrust::device_pointer_cast(*(index2));
thrust::device_ptr<float4> v2buff = thrust::device_pointer_cast(*(values2));
//sorts
thrust::stable_sort_by_key(i1buff, i1buff + particles,v1buff); // 15 ms
//cudaThreadSynchronize();
thrust::stable_sort_by_key(i2buff, i2buff + particles, v2buff); // 17 ms
//repetido so para ver o tempo
thrust::stable_sort_by_key(i1buff, i1buff + particles, v1buff);
//cudaThreadSynchronize();
thrust::stable_sort_by_key(i2buff, i2buff + particles, v2buff); //4 sorts -> 19 ms
//cudaThreadSynchronize();
}
Can some one explain what is going on?
Edit1: I used cudaDeviceSynchronize() to mesure the time (as indicated by @Jérôme-Richard) it takes for each sort, and the first sort always takes longer, even if i change orders. Another fact is that the first sort takes longer if i have my camera closer to the scene, this indicates that maybe Cuda is wating for opengl to do his work, making the first sort "take longer". I also tried having no sorts on my mysort() function, the only thing i had inside was the cudaDeviceSynchronize() and it took the 15 miliseconds, again, this indicates that cuda might be wating on opengl to finish the work from the last frame.
Edit2: I did some more debugging and what i thought appears to be true. The real slow down comes from the cudaGraphicsMapResources calls. According to this (cudaGraphicsMapResources slow speed when mapping DirectX texture):
This function provides the synchronization guarantee that any graphics calls issued before cudaGraphicsMapResources() will complete before any subsequent CUDA work issued in stream begins.
So yeah, It is waiting for opengl to draw some stuff, because the camara distance affects the time cudaGraphicsMapResources takes.
Two important points could explain you observation:
cudaDeviceSynchronize
) in many provided functions in order to ensure returned data transferred from the GPU can be safely read from the CPU side. It also internally use such kind of synchronization when multiple interdependent CUDA kernels are submitted regarding the result of the computed data (you can see that with the Nvidia profiler). Regarding the previous asynchronous CUDA calls made before this function, the over-synchronization can add an unwanted overhead.