Can I be sure that in this example, an atomic operation will be performed in numerical order of threads? Or how to do it differently, excluding the use of only one thread?
__shared__ unsigned int cnt[MAXLEN], s[MAXLEN];
#pragma unroll
for (int i = 0; i < MAXLEN; i+= blockDim.x)
p[atomicSub(cnt + s[threadIdx.x + i], 1) - 1] = threadIdx.x + i;
__syncthreads();
If by numerical order of threads, you mean the thread ID number, or the order in which the threads were started, the answer is no. Those factors play no reliable part in the ordering of events in threads.