Search code examples
simdgpuwavefront

Number of wavefronts that can fit on a SIMD


I'm reading an article about an AMD GPU and am confused by a particular example. Given a SIMD unit with a number of registers, how many wavefronts can occupy a SIMD if they require x amount of registers?

Specifically, if a SIMD unit has 16k registers to share between 1-32 wavefronts. Then this implies that each wavefront can have an average of 8 registers (if there are 32 wavefronts). This is fine.

It then goes on to say that there is a global limit to the number of wavefronts on the SIMD of ~20.6 which would then give each wavefront 11-12 registers.

This part then confuses me. It goes on to say that only 2 wavefronts can occupy a SIMD if they use 83 or more registers. (recalling that wavefronts are 64 wide).

In my calculations: 2 * 83 * 64 = 10628 registers which is way under the 16,384 given per SIMD. You could therefore have 3 wavefronts no problem.

I'm reading the article here if there is something I've missed. (7th paragraph)


Solution

  • Concerning the global limit:

    Each of the amd gpus has a global limit of how many simultaneous wavefronts it can sustain. This limit is model specific, but generally doesn't change between differently cut versions of the same chip. For example for cypress chips (5830, 5850, 5870) it's 496 wavefronts per GPU. Since those chips have different numbers of CUs the maximum number of wavefronts/CU (as calculated by this constraint) goes from 35.4 for 5830 down to 24.8 for 5870. For entry level chips this global limit can calculate to values as high as 96 wavefronts/CU. In these cases the limit of 32 wavefronts/CU (8 workgroups a 4 wavefronts) applies with 8 registers/thread.

    Now for the 2 wavefronts:

    Judging from the numbers given in the ATI Stream Programming Guide OpenCL it seems that the number of usable registers is slightly lower then 16384, so I would guess (as in pure speculation, haven't found any information about that) some registers are used for other purposes not directly accessible by the kernel (Instruction Pointers and whatnot). In the table given there no allocation uses more then 15872 registers so that might be the usable maximum. Of course this is pure speculation, so it might simply be a case of someone using the wrong numbers in the manual and everyone copying it.

    In general the ATI Stream Programming Manual OpenCL is a good resource to learn about this. Be advised though that the link is the result is the result of a quick google search and doesn't seem to point to the most current version (it points to rev 1.03 while I am using rev 1.05 and I have no idea if that is the most current either). Don't know if that makes any important difference, but a more indepth search might be in order.