Search code examples
c++encodingpcmvorbis

Deinterleaving PCM (*.wav) stereo audio data


I understand that PCM data is stored as [left][right][left][right].... Am trying to convert a stereo PCM to mono Vorbis (*.ogg) which I understand is achievable by halving the left and the right channels ((left+right)*0.5). I have actually achieved this by amending the encoder example in the libvorbis sdk like this,

#define READ 1024
signed char readbuffer[READ*4];

and the PCM data is read thus

fread(readbuffer, 1, READ*4, stdin)

I then halved the two channels,

buffer[0][i] = ((((readbuffer[i*4+1]<<8) | (0x00ff&(int)readbuffer[i*4]))/32768.f) + (((readbuffer[i*4+3]<<8) | (0x00ff&(int)readbuffer[i*4+2]))/32768.f)) * 0.5f;

It worked perfectly, but, I don't understand how they deinterleave the left and right channel from the PCM data (i.e. all the bit shifting and "ANDing" and "ORing").


Solution

  • A .wav file typically stores its PCM data in little endian format, with 16 bits per sample per channel. For the usual signed 16-bit PCM file, this means that the data is physically stored as

    [LEFT LSB] [LEFT MSB] [RIGHT LSB] [RIGHT MSB] ...
    

    so that every group of 4 bytes makes up a single stereo PCM sample. Hence, you can find sample i by looking at bytes 4*i through 4*i+3, inclusive.

    To decode a single 16-bit value from two bytes, you do this:

    (MSB << 8) | LSB
    

    Because your read buffer values are stored as signed chars, you have to be a bit careful because both MSB and LSB will be sign-extended. This is undesirable for the LSB; therefore, the code uses

    0xff & (int)LSB
    

    to obtain the unsigned version of the low byte (technically, this works by upcasting to an int, and selecting the low 8 bits; an alternate formulation would be to just write (uint8_t)LSB).

    Note that the MSBs are at indices 1 and 3, and the LSBs are at indices 0 and 2. So,

    ((readbuffer[i*4+1]<<8) | (0x00ff&(int)readbuffer[i*4]))
    

    and

    ((readbuffer[i*4+3]<<8) | (0x00ff&(int)readbuffer[i*4+2]))
    

    are just obtaining the values of the left and right channels as 16-bit signed values by using some bit manipulation to assemble the bytes into numbers.

    Then, each of these values is divided by 32768.0. Note that a signed 16-bit value has a range of [-32768, 32767]. Thus, dividing by 32768 gives a range of approximately [-1, 1]. The two divided values are added to give a number in the range [-2, 2], and then the whole thing is multiplied by 0.5 to obtain the average (a floating-point value in the range [-1, 1]).