Search code examples
crcdata-integrity

In which order should I put LEN/CRC/DATA in a message? Should CRC protect the LEN field?


There's a section (2.5) in Xz format inadequate for long-term archiving:

According to Koopman (p. 50), one of the "Seven Deadly Sins" (i.e., bad ideas) of CRC and checksum use is failing to protect a message length field. This causes vulnerabilities due to framing errors. Note that the effects of a framing error in a data stream are more serious than what Figure 1 suggests. Not only data at a random position are interpreted as the CRC. Whatever data that follow the bogus CRC will be interpreted as the beginning of the following field, preventing the successful decoding of any remaining data in the stream.

He talks about this case, when a message is like this:

ID LEN DATA CRC

If LEN is damaged, then a CRC at a random position will be used. But I fail to see, why it is a problem. At that random position, almost surely there will be an invalid CRC value, so the error is detected.

And he talks about decoding the following data. I fail to see, if the LEN is protected, how one is able to decode the following data either. If LEN is damaged, we cannot find the next message in both cases.

For example, PNG doesn't protect the length field.

So, why is it obviously better, when a LEN field is protected by CRC?

If I were to design a message structure, which is the best way to do that? What order should I use, and what should I protect with CRC? Suppose that the message has the following parts:

  • message type ID (variable length integer)
  • message length (variable length integer)
  • CRC
  • the message data itself

My current design is this:

  1. CRC, protects the whole message
  2. message type ID (variable length integer)
  3. message length (variable length integer)
  4. the message data itself

Is there any drawback of this method?


Solution

  • What Koopman actually says (here) is:

    Failing to protect message length field Results in pointing to data as FCS, giving HD=1

    HD is the Hamming distance, meaning that the probability of a false positive can go up significantly on a low bit-error-rate stream if you look at part of the data as the (faux) check value, instead of the actual check value. To really do it right, you should protect the length field and other header values with their own check value before the data.

    As for your design, putting the CRC first has the disadvantage of having to buffer all of the message to compute the CRC before you can write the message in a stream. You could do type id, length, header crc, message, message crc.