cuda 64-bit idioms gpu-shared-memory bank-conflict

Strategy for minimizing bank conflicts for 64-bit thread-separate shared memory

Suppose I have a full warp of threads in a CUDA block, and each of these threads is intended to work with N elements of type T, residing in shared memory (so we have warp_size * N = 32 N elements total). The different threads never access each other's data. (Well, they do, but at a later stage which we don't care about here). This access is to happen in a loop such as the following:

for(int i = 0; i < big_number; i++) {
    auto thread_idx = determine_thread_index_into_its_own_array();
    T value = calculate_value();
    write_to_own_shmem(thread_idx, value);
}

Now, the different threads may have different indices each, or identical - I'm not making any assumptions this way or that. But I do want to minimize shared memory bank conflicts.

If sizeof(T) == 4, then this is is easy-peasy: Just place all of thread i's data in shared memory addresses i, 32+i, 64+i, 96+i etc. This puts all of i's data in the same bank, that's also distinct from the other lane's banks. Great.

But now - what if sizeof(T) == 8? How should I place my data and access it so as to minimize bank conflicts (without any knowledge about the indices)?

Note: Assume T is plain-old-data. You may even assume it's a number if that makes your answer simpler.

Solution

tl;dr: Use the same kind of interleaving as for 32-bit values.

On later-than-Kepler micro-architectures (up to Volta), the best we could theoretically get is 2 shared memory transactions for a full warp reading a single 64-bit value (as a single transaction provides 32 bits to each lane at most).

This is is achievable in practice by the analogous placement pattern OP described for 32-bit data. That is, for T* arr, have lane i read the idx'th element as T[idx + i * 32]. This will compile so that two transactions occur:

The lower 16 lanes obtain their data from the first 32*4 bytes in T (utilizing all banks)
The higher 16 obtain their data from the successive 32*4 bytes in T (utilizing all banks)

So the GPU is smarter/more flexible than trying to fetch 4 bytes for each lane separately. That means it can do better than the simplistic "break up T into halves" idea the earlier answer proposed.

_{(This answer is based on @RobertCrovella's comments.)}