ID3v2 Specification

Based on http://id3.org/id3v2.3.0 specification, the layout of the frame header is:

Frame ID       $xx xx xx xx (four characters)
Size           $xx xx xx xx
Flags          $xx xx

But same page just couple of lines below that says that frames that allow different types of text encoding have a text encoding description byte directly after the frame size. If ISO-8859-1 is used this byte should be $00, if Unicode is used it should be $01.

This is confusing, as the flags (2 bytes) should be directly after the frame size information, so I would expect the encoding byte to be after the flags information.

So now what is correct?

Frame ID       $xx xx xx xx (four characters)
Size           $xx xx xx xx
Flags          $xx xx
Encoding       $xx
Text

Frame ID       $xx xx xx xx (four characters)
Size           $xx xx xx xx
Encoding       $xx
Flags          $xx xx
Text

Solution

I think this might actually be a ~~mistake~~ case of bad wording in the spec. I found two diagrams in the ID3v2 Chapter Frame Addendum showing examples of complete headers. That document describes two newly introduced frame types, which are not interesting to the question at hand. But fortunately, it also contains examples of embedded 'Title/Songname/Content description'-frame (TIT2) and 'Subtitle/Description refinement'-frame (TIT3), which are both text frames*:

enter image description here

According to the diagram, the Title frame (ID: TIT2) has the following structure: First the frame header:

Frame ID       $xx xx xx xx (four characters)
Size           $xx xx xx xx
Flags          $xx xx

which is then directly followed by ID-dependent fields:

Text encoding  $xx Information    
<text string according to encoding>

This layout makes the most sense to me. If you still have doubts about the correct layout, you could check out the source of one of the existing implementations.

Sidenote: in the ID3v2.4.0 specification they changed the confusing sentence to.

Frames that allow different types of text encoding contains a text encoding description byte.

*_{Only frames that allow different types of text encoding have a text encoding description byte.
Unsurprisingly, most of these are text frames}