First time executing sort from thrust it takes too long

im working on a fluid simulator using opengl (implementing the sph algorithm). I've tried many methods to run my simulator, first i used octrees, after that hashmaps and now i am trying to use Z order, and for that i need to order my particles based on their index.

What i am having some trouble understanding is the fact that if i have one thrust::sort it takes 15 miliseconds, if i have two thrust::sort it takes 17 miliseconds.

For more clarification, i am doing my simulator in opengl (all my buffers are created using opengl), and i use cuda interops in order to sort my buffers with thrust, that uses cuda.

This is the part where i get my buffers and "link" them to cuda

//I use this if to do the registerBuffer only one time
if (first == 0) {
        //index
        IBuffer* bIndex = RESOURCEMANAGER->getBuffer("particleLib::Index");
        int buffIdIndex = bIndex->getPropi(IBuffer::ID);
        //Position
        IBuffer* bPosition = RESOURCEMANAGER->getBuffer("particleLib::Position");
        int buffIdPosition = bPosition->getPropi(IBuffer::ID);
        //TempIndex
        IBuffer* bTempIndex = RESOURCEMANAGER->getBuffer("particleLib::TempIndex");
        int buffIdTempIndex = bTempIndex->getPropi(IBuffer::ID);
        //Velocity
        IBuffer* bVelocity = RESOURCEMANAGER->getBuffer("particleLib::Velocity");
        int buffIdVelocity = bVelocity->getPropi(IBuffer::ID);

        // register this buffer object with CUDA
        //So devia chamar isto uma vez
        cudaGraphicsGLRegisterBuffer(&cuda_ssbo_Index, buffIdIndex, cudaGraphicsMapFlagsNone);
        cudaGraphicsGLRegisterBuffer(&cuda_ssbo_TempIndex, buffIdTempIndex, cudaGraphicsMapFlagsNone);
        cudaGraphicsGLRegisterBuffer(&cuda_ssbo_Position, buffIdPosition, cudaGraphicsMapFlagsNone);
        cudaGraphicsGLRegisterBuffer(&cuda_ssbo_Velocity, buffIdVelocity, cudaGraphicsMapFlagsNone);
        first = 1;
    }
    

    // map OpenGL buffer object for writing from CUDA
    int* dptrssboIndex;
    int* dptrssboTempIndex;
    float4 * dptrssboPosition;
    float4 * dptrssboVelocity;

    cudaGraphicsMapResources(1, &cuda_ssbo_Index, 0);
    cudaGraphicsMapResources(1, &cuda_ssbo_TempIndex, 0);
    cudaGraphicsMapResources(1, &cuda_ssbo_Position, 0);
    cudaGraphicsMapResources(1, &cuda_ssbo_Velocity, 0);

    size_t num_bytesssbo_Index;
    size_t num_bytesssbo_TempIndex;
    size_t num_bytesssbo_Position;
    size_t num_bytesssbo_Velocity;

    cudaGraphicsResourceGetMappedPointer((void**)&dptrssboIndex, &num_bytesssbo_Index, cuda_ssbo_Index);
    cudaGraphicsResourceGetMappedPointer((void**)&dptrssboTempIndex, &num_bytesssbo_TempIndex, cuda_ssbo_TempIndex);
    cudaGraphicsResourceGetMappedPointer((void**)&dptrssboPosition, &num_bytesssbo_Position, cuda_ssbo_Position);
    cudaGraphicsResourceGetMappedPointer((void**)&dptrssboVelocity, &num_bytesssbo_Velocity, cuda_ssbo_Velocity);

    mysort(&dptrssboIndex,&dptrssboPosition, &dptrssboTempIndex, &dptrssboVelocity,216000);

    cudaGraphicsUnmapResources(1, &cuda_ssbo_Index, 0);
    cudaGraphicsUnmapResources(1, &cuda_ssbo_TempIndex, 0);
    cudaGraphicsUnmapResources(1, &cuda_ssbo_Position, 0);
    cudaGraphicsUnmapResources(1, &cuda_ssbo_Velocity, 0);

This is the code from mysort

void mysort(int ** index1, float4 ** values1, int** index2, float4 ** values2,int particles){
    
    thrust::device_ptr<int> i1buff = thrust::device_pointer_cast(*(index1));
    thrust::device_ptr<float4> v1buff = thrust::device_pointer_cast(*(values1));
    thrust::device_ptr<int> i2buff = thrust::device_pointer_cast(*(index2));
    thrust::device_ptr<float4> v2buff = thrust::device_pointer_cast(*(values2));

    //sorts
    thrust::stable_sort_by_key(i1buff, i1buff + particles,v1buff); // 15 ms
    //cudaThreadSynchronize();
    thrust::stable_sort_by_key(i2buff, i2buff + particles, v2buff); // 17 ms


    //repetido so para ver o tempo
    thrust::stable_sort_by_key(i1buff, i1buff + particles, v1buff);
    //cudaThreadSynchronize();
    thrust::stable_sort_by_key(i2buff, i2buff + particles, v2buff); //4 sorts -> 19 ms

    //cudaThreadSynchronize();
}

Can some one explain what is going on?

Edit1: I used cudaDeviceSynchronize() to mesure the time (as indicated by @Jérôme-Richard) it takes for each sort, and the first sort always takes longer, even if i change orders. Another fact is that the first sort takes longer if i have my camera closer to the scene, this indicates that maybe Cuda is wating for opengl to do his work, making the first sort "take longer". I also tried having no sorts on my mysort() function, the only thing i had inside was the cudaDeviceSynchronize() and it took the 15 miliseconds, again, this indicates that cuda might be wating on opengl to finish the work from the last frame.

Edit2: I did some more debugging and what i thought appears to be true. The real slow down comes from the cudaGraphicsMapResources calls. According to this (cudaGraphicsMapResources slow speed when mapping DirectX texture):

This function provides the synchronization guarantee that any graphics calls issued before cudaGraphicsMapResources() will complete before any subsequent CUDA work issued in stream begins.

So yeah, It is waiting for opengl to draw some stuff, because the camara distance affects the time cudaGraphicsMapResources takes.

Solution

Two important points could explain you observation:

The first CUDA function call implicitly initialize the runtime (quite slow).
The actual content of the arrays to be sorted can/often impact performance of a sort (regarding the algorithm used in the Thrust implementation). Once data are sorted, they can be sorted faster because they are already sorted.
Thrust make few synchronizations (ie. it calls cudaDeviceSynchronize) in many provided functions in order to ensure returned data transferred from the GPU can be safely read from the CPU side. It also internally use such kind of synchronization when multiple interdependent CUDA kernels are submitted regarding the result of the computed data (you can see that with the Nvidia profiler). Regarding the previous asynchronous CUDA calls made before this function, the over-synchronization can add an unwanted overhead.