relation between warp scheduling and warp context switching in Cuda

As far as I understand, a ready warp is a warp that can be executed in warp scheduling. A waiting warp is waiting for source operands to be fetched or computed so that it can't be executed. Warp scheduler chooses a ready warp to execute for "warp scheduling".

On the other hand, when a warp has a pipeline stall or a long global memory latency, another warp will be brought into execution to hide the latency. This is the basic idea of "warp context switching" in cuda.

My question is: What is the relation between warp scheduling and warp context switching in Cuda. To elaborate my question, below is a example.

E.g. When warp A is stalled, and warp A is a waiting warp for global memory to be fetched, once the element is fetched, warp A will be scheduled or switched into the ready warp pool. Based on this, warp context switching is a part of warp scheduling. Is it correct?

Can anyone also provide any references on the warp context switching and warp scheduling in Cuda? It seems Nvidia does not make these documents publicly available.

Thanks in advance for any reply.

Solution

The ready warps are those which can be scheduled on the next cycle. Stalled warps cannot be scheduled.

To answer the question about latency with an extremely simplified example, suppose that the latency to main memory is 8 execution cycles, and let's ignore the fact that the machine is pipelined. Let's assume all instructions can execute in one cycle, if the data is ready.

Now suppose I have C code like this:

int idx = threadIdx.x+blockDim.x*blockIdx.x;

int myval = global_data[idx]*global_data[idx];

That is, myval should contain the square of an item in global memory, when the code is complete. This will be decomposed into a sequence of assembly language instructions. Let's suppose they look something like this:

I0: R0 = global_data[idx];
I1: R1 = R0 * R0;
I2: ...

Every thread can execute the first line of code (initially there are no stalls); there is no dependency yet, and a read by itself does not cause a stall. However every thread can then move on to the second line of code, and now the value of R0 must be correct, so a stall occurs, waiting for the read to be retrieved. As mentioned already, suppose the latency is 8 cycles, and using a warp with of 32 and a threadblock size of 512, we have a total of 16 warps. Let's suppose for simplicity we have a Fermi SM with only 32 units of execution. The sequence will look something like this:

cycle:     ready warps:   executing warp:     instruction executed:     Latency:
    0            1-16                   0            I0 -> I1 (stall)    --
    1            2-16                   1            I0 -> I1 (stall)     | --
    2            3-16                   2            I0 -> I1 (stall)     |  |
    3            4-16                   3            I0 -> I1 (stall)     |  |
    4            5-16                   4            I0 -> I1 (stall)     |  |
    5            6-16                   5            I0 -> I1 (stall)     |  |
    6            7-16                   6            I0 -> I1 (stall)     |  |
    7            8-16                   7            I0 -> I1 (stall)     |  |
    8          0,9-16                   8            I0 -> I1 (stall)    <-  |
    9          1,9-16                   0            I1 -> I2            <----

What we see is that after the latency is fulfilled by executing instructions from other warps, a previously "stalled" warp will re-enter the ready warp pool, and it's possible for the scheduler to schedule that warp again (i.e. to do the multiply operation contained in I1) on the very next cycle after the stall condition is removed.

There is no contradiction between latency hiding and warp scheduling. They work together, for a code with sufficient work to do, to hide the latency associated with various operations, such as reading from global memory.

The above example is a simplification compared to actual behavior, but it adequately represents the concepts of latency hiding and warp scheduling, to demonstrate how warp scheduling, in the presence of "enough work to do", can hide latency.