cuda nvidia video-processing yuv color-conversion

NVIDIA CUDA YUV (NV12) to RGB conversion algorithm breakdown

I am trying to modify the original YUV->RGB kernel provided in sample code of NVIDIA Video SDK and I need help to understand some of its parts.

Here is the kernel code:

template<class YuvUnitx2, class Rgb, class RgbIntx2>
__global__ static void YuvToRgbKernel(uint8_t* pYuv, int nYuvPitch, uint8_t* pRgb, int nRgbPitch, int nWidth, int nHeight) {

int x = (threadIdx.x + blockIdx.x * blockDim.x) * 2;
int y = (threadIdx.y + blockIdx.y * blockDim.y) * 2;
if (x + 1 >= nWidth || y + 1 >= nHeight) {
    return;
}

uint8_t* pSrc = pYuv + x * sizeof(YuvUnitx2) / 2 + y * nYuvPitch;
uint8_t* pDst = pRgb + x * sizeof(Rgb) + y * nRgbPitch;

YuvUnitx2 l0 = *(YuvUnitx2*)pSrc;
YuvUnitx2 l1 = *(YuvUnitx2*)(pSrc + nYuvPitch);
YuvUnitx2 ch = *(YuvUnitx2*)(pSrc + (nHeight - y / 2) * nYuvPitch);

//YuvToRgbForPixel - returns rgba encoded in uint32_t (.d)

*(RgbIntx2*)pDst = RgbIntx2{
    YuvToRgbForPixel<Rgb>(l0.x, ch.x, ch.y).d,
    YuvToRgbForPixel<Rgb>(l0.y, ch.x, ch.y).d,
};
*(RgbIntx2*)(pDst + nRgbPitch) = RgbIntx2{
    YuvToRgbForPixel<Rgb>(l1.x, ch.x, ch.y).d,
    YuvToRgbForPixel<Rgb>(l1.y, ch.x, ch.y).d,
};
}

Here are my basic assumptions, some of them are possibly wrong:

NV12 has two planes, 1 for Luma and 2 for interleaved chroma.
The kernel tries to write 4 pixels at a time.

If assumption 2 is correct, the question is why same chroma (ch) values are used for all 4 pixels? And If I am wrong on 2, please explain what exactly happens here.

Solution

The Chroma-planes on NV12 or NV21 are subsampled by a factor of 2.

For every 2x2 macro pixel in the output there are 4 luma (Y) channels, 1 Cb and 1 Cr element.