Search code examples
image-processingyuvimage-formatsnv12-nv21

Image formats NV12 storage in memory


I am totally understand about the size of the NV12 format as described in question

NV12 format and UV plane

Now I am reading from two sources about the storage of UV plane in this format: one is https://msdn.microsoft.com/en-us/library/windows/desktop/dd206750(v=vs.85).aspx

NV12

All of the Y samples appear first in memory as an array of unsigned char values with an even number of lines. The Y plane is followed immediately by an array of unsigned char values that contains packed U (Cb) and V (Cr) samples. When the combined U-V array is addressed as an array of little-endian WORD values, the LSBs contain the U values, and the MSBs contain the V values. NV12 is the preferred 4:2:0 pixel format for DirectX VA. It is expected to be an intermediate-term requirement for DirectX VA accelerators supporting 4:2:0 video. The following illustration shows the Y plane and the array that contains packed U and V samples.

What I understand is: in UV plane each U and V are stored in single byte.

When I read from wikipedia about this: https://wiki.videolan.org/YUV#NV12

It says:

NV12

Related to I420, NV12 has one luma "luminance" plane Y and one plane with U and V values interleaved. In NV12, chroma planes (blue and red) are subsampled in both the horizontal and vertical dimensions by a factor of 2. For a 2x2 group of pixels, you have 4 Y samples and 1 U and 1 V sample. It can be helpful to think of NV12 as I420 with the U and V planes interleaved. Here is a graphical representation of NV12. Each letter represents one bit: For 1 NV12 pixel: YYYYYYYY UVUV For a 2-pixel NV12 frame: YYYYYYYYYYYYYYYY UVUVUVUV For a 50-pixel NV12 frame: Y*8*50 (UV)*2*50 For a n-pixel NV12 frame: Y*8*n (UV)*2*n

What I understand here is : each U and V are interleaved bit by bit in each byte. So each each byte of UV plane will contain 4U bits and 4V bits interleaved.

Can anyone clarify my doubt?


Solution

  • TL;DR: MSDN is correct

    To verify this (or at least verify that there is no interleaving on bit level), one can use ffmpeg, which is a widely used video tool. I did the following experiment:

    1. Make a file containing some text (I took the example Lorem Ipsum text)
    2. Tell ffmpeg to read it as a I420 video frame of some small size
    3. Tell ffmpeg to convert it to NV12 format
    4. Print it

    Here is an example commandline for (2) and (3):

    ffmpeg -s 96x4 -i example_i420.yuv -pix_fmt nv12 example_nv12.yuv
    

    Here is what I got in the output:

    Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sutnett uirn acduilppias cqiunig oeflfiitc,i as edde sdeor uenitu smmooldl itte mapnoirm iindc iedsitd ulnatb ourtu ml.a bLoorree me ti pdsoulmo rdeo lmoarg nsai ta laimqeuta,. cUotn seenci

    I marked the chroma (U and V) samples in bold. It is evident that these are the same values (ASCII letters), just in scrambled order. If any bit-interleaving were performed, I would get different values.

    So the description in the VLC wiki (BTW it's not Wikipedia) is incorrect. Someone with the name "Edwardw" added the "illustration" mentioning pixels here, and later changed it to "bits" here. I hope someone changes it to be less misleading (the wiki requires registration so I cannot edit it).