How does a GPU group threads into warps/wavefronts?

My understanding is that warp is a group of threads that defined at runtime through the task scheduler, one performance critical part of CUDA is the divergence of threads within a warp, is there a way to make a good guess of how the hardware will construct warps within a thread block?

For instance I have start a kernel with 1024 threads in a thread block, how is the warps be arranged, can I tell that (or at least make a good guess) from the thread index?

Since by doing this, one can minimize the divergence of threads within a given warp.

Solution

The thread arrangement inside the warp is implementation dependant but atm I have experienced always the same behavior:

A warp is composed by 32 threads but the warp scheduller will issue 1 instruction for halp a warp each time (16 threads)

If you use 1D blocks (only threadIdx.x dimension is valid) then the warp scheduller will issue 1 instruction for threadIdx.x = (0..15) (16..31) ... etc
If you use 2D blocks (threadIdx.x and threadIdx.y dimension are valid) then the warp scheduller will try to issue following this fashion:

threadIdx.y = 0 threadIdx.x = (0 ..15) (16..31) ... etc

so, the threads with consecutive threadIdx.x component will execute the same instruction in groups of 16.