Search code examples
cudashuffleintrinsics

Is mask adaptive in __shfl_up_sync call?


Basically, it is a materialized version of this post. Suppose a warp need to process 4 objects(say, pixels in image), each 8 lanes are grouped together to process one object: enter image description here Now I need do internal shuffle operations during processing one object(i.e. among 8 lanes of this object), it worked for each object just setting mask as 0xff:

uint32_t mask = 0xff;
__shfl_up_sync(mask,val,1);

However, to my understanding, set mask as 0xff will force the lane0:lane7 of object0(or object3? also stuck on this point) to participate, but I ensure that above usage applies to each object after a mass of trials. So, my question is whether __shfl_up_sync call can adapt argument mask to force corresponding lanes participating?

Update
Actually, this problem came from codes of libSGM that I tried to parse. In particular, it solves minimal cost path with dynamic programming in a decently parallel way. Once program reaches this line after launching the kernel aggregate_vertical_path_kernel with execution configuration:

//MAX_DISPARITY is 128 and BLOCK_SIZE is 256
//Basically, each block serves to process 32 pixels in which each warp serves to process 4.
const int gdim = (width + PATHS_PER_BLOCK - 1) / PATHS_PER_BLOCK;
const int bdim = BLOCK_SIZE;
aggregate_vertical_path_kernel<1, MAX_DISPARITY><<<gdim, bdim, 0, stream>>>(...)

An object dp is instantiated from DynamicProgramming<DP_BLOCK_SIZE, SUBGROUP_SIZE>:

static constexpr unsigned int DP_BLOCK_SIZE = 16u;
...
//MAX_DISPARITY is 128
static const unsigned int SUBGROUP_SIZE = MAX_DISPARITY / DP_BLOCK_SIZE;
...
DynamicProgramming<DP_BLOCK_SIZE, SUBGROUP_SIZE> dp;

Keep following the program, dp.updata() will be invoked in which __shfl_up_sync is used to access the last element of previous DP_BLOCK and __shfl_down_sync is used to access the first element of the rear DP_BLOCK. Besides, each 8 lanes in one warp are grouped together:

//So each 8 threads are grouped together to process one pixel in which each lane is contributed to one DP_BLOCK for corresponding pixel.
const unsigned int lane_id = threadIdx.x % SUBGROUP_SIZE;

Here it comes, once program reaches this line:

//mask is specified as 0xff(255)
const uint32_t prev =__shfl_up_sync(mask, dp[DP_BLOCK_SIZE - 1], 1);

each lane in one warp does shuffle with the same mask 0xff, which causes my above question.


Solution

  • Its confusing when you do this:

    lane0:lane7 | lane0:lane7 | lane0:lane7 | lane0:lane7
    

    because a warp doesn't have 4 sets of lanes, that are numbered lane 0 to lane 7. It has one set of lanes, numbered 0 to lane 31.

    lane 31 | lane 30 | ... | lane 0
    

    Note that I have ordered the lanes this way because that corresponds to the bit order in the mask. It should be evident which bit corresponds to which lane. bit 0 in the mask parameter corresponds to lane 0, and so on.

    This confusion is compounded by the fact that you are only specifying 8 bits, i.e. 8 lanes, in your mask:

    uint32_t mask = 0xff;
    

    If you want the warp to have a correct possibility to use all 32 lanes to process all 4 objects, you must specify a 32-bit mask:

    uint32_t mask = 0xffffffff;
    

    There is no "adaptation" of an 8-bit mask to apply to each group of 8 lanes in the warp. You must explicitly specify the mask for each of the 32 lanes. This is true even if the width parameter is used (see below).

    If you want to cause the shuffle operation to work only in an 8-bit group (with 4 logical shuffles) that is what the width parameter is for:

    T __shfl_up_sync(unsigned mask, T var, unsigned int delta, int width=warpSize);
                                                                   ^^^^^