MP3 exact frame size calculation

I know there are already a few questions like this here on SO, however they do not fully explain the formulas presented in the answers.

Im writing a parser that should be able to process MPEG-1,2,2.5 Audio Layer I,II,III frame headers. The goal is to calculate the exact size of the frame, including header, CRC (if present) and any data or metadata of this frame (basically the number of bytes between the start of one header and the beginning of the next one).

One of the code snippets/formulas commonly seen on the internet to achieve this is (in no specific programming language):

padding = doesThisFramehavePadding ? 1 : 0;
coefficient = sampleCount / 8;

// makes sense to me. the slot size seems to be the smallest addressable space in an mp3 frame 
// and is thus important for padding.
slotSize = mpegLayer == Layer1 ? 4 : 1;

// all fine here. bitRate / sampleRate yields bits per sample, multiplied by that weird 
// coefficient from earlier probably gives us <total bytes> per <all samples in this frame>.
// then add padding times slotSize.
frameSizeInBytes = ((coefficient * bitRate / sampleRate) + padding) * slotSize;

I have multiple questions regarding above code snippet:

What exactly would this "coefficient" even represent? As it's just sampleCount / 8 it's probably just something used to convert the units from bits to bytes in the final calculation, right?
If my assumption from 1. is correct: if (coefficient * bitRate / sampleRate) already yields something in bytes what would multiplying it with the slot size achieve for Audio Layer I specifically? Wouldn't this imply that the unit of (coefficient * bitRate / sampleRate) should have been "slots" earlier, not "bytes"? If so, then what does the coefficient do, like why divide by 8, even for audio layer 1 frames? Is this even correct?
Questions 1. and 2. lead me to believe that the code snippet above may not even be correct. If so what would the correct calculation for MPEG versions 1,2,3.5 and layers I,II and III look like?
Does above calculation still yield the correct result if the CRC protection bit is set in the frame header (i.e. 16 additional CRC bytes are appended to the header)?
Speaking of the header: are the 4 header bytes included in the resulting frameSizeInBytes or does the result indicate the length of the frame data/body?

Basically all these sub-questions can be summarized to:

What is the formula to calculate the total and exact length of the current frame in bytes, including the header, and stuff like CRC, or Xing and LAME meta data frames and other eventualities?

Solution

I wrote that in Delphi/Pascal and the function returns either 0 for a bad frame or its exact size of bytes. It is based on multiple websites - the first two illustrate and explains an MPEG audio frame header with full precision, while the third has crucial additions like the formula(s):

http://checkmate.gissen.nl/headers.php (copy of the second, but with more colors)
http://www.mp3-tech.org/programmer/frame_header.html
http://mpgedit.org/mpgedit/mpeg_format/mpeghdr.htm (might be the original and contains the formula calculating the actual byte length of a frame for MPEG 1):

For Layer I files us this formula: FrameLengthInBytes = (12 * BitRate / SampleRate + Padding) * 4
For Layer II & III files use this formula: FrameLengthInBytes = 144 * BitRate / SampleRate + Padding
https://en.wikipedia.org/wiki/MP3#File_structure with a good picture illustrating the header
https://www.codeproject.com/Articles/8295/MPEG-Audio-Frame-Header for even more details about VBR frames and how to calculate the overall playback duration

const
  MPEG_BITRATE: Array[0.. 1, 1.. 3, 0.. 14] of Word=  // MPEG 2/1, Layer III/II/I
  ( ( ( 0,  8, 16, 24,  32,  40,  48,  56,  64,  80,  96, 112, 128, 144, 160 )  // 2 Layer III
    , ( 0,  8, 16, 24,  32,  40,  48,  56,  64,  80,  96, 112, 128, 144, 160 )  // 2 Layer II
    , ( 0, 32, 48, 56,  64,  80,  96, 112, 128, 144, 160, 176, 192, 224, 256 )  // 2 Layer I
    )
  , ( ( 0, 32, 40, 48,  56,  64,  80,  96, 112, 128, 160, 192, 224, 256, 320 )  // 1 Layer III
    , ( 0, 32, 48, 56,  64,  80,  96, 112, 128, 160, 192, 224, 256, 320, 384 )  // 1 Layer II
    , ( 0, 32, 64, 96, 128, 160, 192, 224, 256, 288, 320, 352, 384, 416, 448 )  // 1 Layer I
    )
  );

  MPEG_SAMPLERATE: Array[0.. 3, 0.. 2] of Word=  // MPEG 2.5/?/2/1
  ( ( 11025, 12000,  8000 )
  , (     0,     0,     0 )
  , ( 22050, 24000, 16000 )
  , ( 44100, 48000, 32000 )
  );


// Read from a file and give back a positive 16-bit value of the PAYLOAD size,
// excluding the 4 bytes header size. Make sure we can read at least 4 byte off the 
// file. If a non-standard condition is met, the function exits with size 0,
// indicating a bad frame.
function IsValidMpegHeader( oIn: TStream ): Word;
var
  aHead: Array[1.. 4] of Byte;  // 4 bytes.
  iBitRateKilo, iSampleRate: Word;  // 16-bit; looked up from the array constants above.
  iPadding, iSlotSize, iSamples: Byte;  // 8-bit.
begin
  oIn.Read( aHead[1], 4 );  // Read next 4 bytes into array.



  // 11 bits sync:
  if (aHead[1]<> $FF) then exit;  // First 8 bits.
  if (aHead[2] and $E0)<> $E0 then exit;  // Next 3 bits.

  // 2 bits MPEG version:
  if (aHead[2] and $18)= $08 then exit;  // $00=2.5; $08=reserved; $10=2; $18=1

  // 2 bits Audio Layer:
  if (aHead[2] and $06)= $00 then exit;  // $00=reserved; $02=III; $04=II; $06=I

  // 1 bit "Protection" flag. End of 16 bits.



  // 4 bits Bitrate:
  if (aHead[3] and $F0)= $F0 then exit;  // 0=free, thus allowed; all 4 bits set=bad

  // 2 bits Frequency:
  if (aHead[3] and $0C)= $0C then exit;  // All bits=reserved.

  // 1 bit "Padding" flag.

  // 1 bit "Private" flag. End of 24 bits.



  // 2 bits "Channel Mode": 0=stereo; 1=joint stereo; 2=dual channel; 3=mono

  // 2 bits Mode Extension.

  // 1 bit "Copyright" flag.

  // 1 bit "Original" flag.

  // 2 bits Emphasis. End of 32 bit.
  if (aHead[4] and $03)= $02 then exit;  // $00=none; $01=50/15 ms; $02=reserved; $03=CCIT J.17



  // 1 upper bit from 2nd byte, shifted 3 bits to the right  = MPEG version
  // 2 bits      from 2nd byte, shifted 1 bit  to the right  = Audio Layer
  // 4 bits      from 3rd byte, shifted 4 bits to the right  = Bitrate
  iBitRateKilo:= MPEG_BITRATE[(aHead[2] shr 3) and 1][(aHead[2] shr 1) and 3][(aHead[3] shr 4) and $F];

  // Layer II disallows specific combinations.
  if (aHead[2] and $06)= $04 then
  case iBitRateKilo of
    32, 48, 56, 80:     if (aHead[4] and $C0)<> $C0 then exit;  // Only single channel allowed.
    224, 256, 320, 384: if (aHead[4] and $C0)= $C0 then exit;  // No single channel allowed.
  end;

  // Samples per frame in bytes, not bits.
  if (aHead[2] and $18)= $18 then begin  // MPEG v1
    case aHead[2] and $06 of
      $06: iSamples:= 12;  // Layer I
    else 
      iSamples:= 144;  // Layer II and III
    end;
  end else begin  // MPEG v2 and v2.5
    case aHead[2] and $06 of
      $06: iSamples:= 12;  // Layer I
      $04: iSamples:= 144;  // Layer II
    else 
      iSamples:= 72;  // Layer III
    end;
  end;

  // Set slot size and padding (in bytes).
  if (aHead[2] and $06)= $06 then iSlotSize:= 4 else iSlotSize:= 1;  // Layer I = 32 bits.
  if (aHead[3] and $02)= $02 then iPadding := 1 else iPadding := 0;  // Padding bit.

  // 2 bits from second byte, shifted 3 bits to the right  = MPEG version
  // 2 bits from third byte,  shifted 2 bits to the right  = Frequency
  iSampleRate:= MPEG_SAMPLERATE[(aHead[2] shr 3) and 3][(aHead[3] shr 2) and 3];
  if iSampleRate= 0 then exit;


  // The division itself is a real/float one, not an Integer division. The quotient
  // must not be rounded, but instead its Integer part must be cut off from any decimals. 
  // If it is 1152.9 then it still means 1152 bytes, not 1153. This calculation works
  // for all MPEG versions, not just v1.
  result:= Trunc( ((iSamples* iBitRateKilo* 1000/ iSampleRate)+ iPadding)* iSlotSize );


  (* Originally I thought the hash sum would make the frame bigger, but after experiencing 
     a couple of files the 2 CRC bytes are meant to be in the frame payload already. This
     is also confirmed by https://hydrogenaud.io/index.php/topic,119033.0.html indicating
     that this was never meant for (stored) files, but instead only for (network) transmissions
     and would indeed waste 16 valuable bits.
  if (aHead[2] and $01)= $00 then Inc( result, 2 );  // 16-bit CRC after header. *)
end;

If the function returns 0 you're most likely in any metadata tag's area. The calculated frame size is for its payload=content and does not count the 4 bytes of header data. It's exactly the amount of bytes to seek forward in the file to be in front of the next frame's headers.

Yes.
Around the formulas this is explained a bit better:

Padding is used to fit the bit rates exactly. For an example: 128k 44.1kHz layer II uses a lot of 418 bytes and some of 417 bytes long frames to get the exact 128k bitrate. For Layer I slot is 32 bits long, for Layer II and Layer III slot is 8 bits long.

First, let's distinguish two terms frame size and frame length. Frame size is the number of samples contained in a frame. It is constant and always 384 samples for Layer I and 1152 samples for Layer II and Layer III. Frame length is length of a frame when compressed. It is calculated in slots. One slot is 4 bytes long for Layer I, and one byte long for Layer II and Layer III. When you are reading MPEG file you must calculate this to be able to find each consecutive frame. Remember, frame length may change from frame to frame due to padding or bitrate switching.
It is correct (although I wouldn't fully understand it either as written there). Since I wrote my code I've tested it with all variants of MP3s and I've always found the next frame at exactly the expected position.
Yes, because "added to the header" merely means topic wise, not context wise. Precisely the first 2 bytes of the frame payload are for the 16-bit CRC hash.
No, frame header size is always 4 bytes and not included in the frame length.

I wrote this to exactly count frames in MP3 files encoded with variable bitrates, where frame sizes can have very different lengths. And I was fed up with lazy overall calculations that would only do guesswork.

The "special" VBR frames that don't contain audio but instead additional info can be fairly well detected, too. For this we need to know the "side info" of a frame:

const
  // https://www.codeproject.com/Articles/8295/MPEG-Audio-Frame-Header
  MPEG_SIDEINFO: Array[0.. 1, FALSE.. TRUE] of Byte=   // MPEG 2/1, Mono/Non-mono
  ( (  9, 17 )
  , ( 17, 32 )  // Only MPEG 1 non-mono has the offset after 32 bytes
  );


// Returns TRUE if one of the identifications matches.
function IsVbrFrame( oIn: TStream ): Boolean;
var
  iSideInfo: Byte;
  aIdent: Array[1.. 4] of Char;  // Like bytes, but treating it as ASCII.
begin
  // 1 upper bit from 2nd byte, shifted 3 bits to the right  = MPEG version
  // 2 highest bits from 4th byte (Channel Mode) equal mode "Mono"?
  iSideInfo:= MPEG_SIDEINFO[(aHead[2] shr 3) and 1][(aHead[4] and $C0)<> $C0];

  // After we read the 4 bytes from the header, go forward either 9, 17 or 32 
  // bytes and read 4 bytes of identification for almost any VBR frame.
  oIn.Seek( iSideInfo, soCurrent );
  oIn.Read( aIdent[1], 4 );

  if (aIdent= 'Xing')
  or (aIdent= 'Info')
  or (aIdent= 'LAME')
  or (aIdent= 'UUUU')
  or (aIdent= 'GOGO')
  or (aIdent= 'MPGE') then begin
    result:= TRUE;
  end else begin
    // Go back the 4 bytes we just read and the sideinfo portion we skipped
    // to then always jump 32 bytes forwards, regardless of MPEG version and
    // Channel Mode. Then read 4 bytes again and check for the only known ID.
    oIn.Seek( 0- 4- iSideInfo, soCurrent );
    oIn.Seek( 32, soCurrent );
    oIn.Read( aIdent[1], 4 );

    result:= (aIdent= 'VBRI');
  end;
end;

You may also want to read

...which is also useful to know where the first audio frame is to be found (after tags at the start of the file) and when you've reached the last one (before tags at the end of the file).