I'm wondering if there is any inherent advantage to having a kernel be > 1d besides the convenience of abstraction. I figure that if the dimension of the kernel is relevant, the answer might have to do with the layout of the gpu. I would generally prefer to stick to 1d and flatten higher dimensional data. Is there anything wrong with this approach on a technical level?
It's even worse that what Jerome said... combining 2D/3D locations is not so cheap. Just think about it:
flattened_block_id = blockIdx.x + blockIdx.y * gridDim.x + blockIdx.z * gridDim.y * gridDim.z;
flattened_thread_id = threadIdx.x + threadIdx.y * blockDim.x + threadIdx.z * blockDim.y * blockDim.z;
block_volume = blockDim.x * blockDim.y * blockDim.z;
global_flattened_id = flattened_thread_id + block_volume * flattened_block_id;
And this is all when ignoring the dimension bounds. Which you can't do, generally. So, you need some sign extends, and half the multiplications and additions become 64-bit. That's a lot of operations! And just think about those cases where you have a condition such as:
if (is_nice(global_flattened_id)) { return; }
with that in place, you just made sure you have to pay for all those operations even when your thread isn't going to do anything.
Having said that... when I was writing kernels which work on 1D data, I also had the notion that these extra dimensions are just silly. Then I started actually having 3D (or 5D) data and they got useful very quickly :-)
Finally, remember: CUDA is a biproduct of 3D graphics shader evolution. It needed 3D representations before it needed you or me as users...