c string data-structures file-format null-terminated

A C style string file format conundrum

I'm very confused with this wee little problem I have. I have a non-indexed file format header. (more specifically the ID3 header) Now, this header stores a string or rather three bytes for conformation that the data is actually an ID3 tag (TAG is the string btw.) Point is, now that this TAG in the file format is not null-terminated. So there are two things that can be done:

Load the entire file with fread and for non-terminated string comparison, use strncmp. But:
1. This sounds hacky
2. What if someone opens it up and tries to manipulate the string w/o prior knowledge of this?
The other option is that the file be loaded, but the C struct shouldn't exactly map to the file format, but include proper null-terminators, and then each member should be loaded using a unique call. But, this too feels hacky and is tedious.

Help, especially from people who have practical experience with dealing with such stuff, is appreciated.

Solution

If the file format specification says a certain three bytes have the values corresponding to 'T', 'A', 'G' (84, 65, 71), then you should compare just those three bytes.

For this example, strncmp() is OK. In general, memcmp() is better because it doesn't have to worry about string termination, so even if the byte stream (tag) you are comparing contains ASCII NUL '\0' characters, memcmp() will work.

You also need to recognize whether the file format you are working with is primarily printable data or whether it is primarily binary data. The techniques you use for printable data can be different from the techniques used for binary data; the techniques used for binary data sometimes (but not always) translate for use with printable data. One big difference is that the lengths of values in binary data is known in advance, either because the length is embedded in the file or because the structure of the file is known. With printable data, you are often dealing with variable-length encodings with implicit boundaries on the fields - and no length encoding information ahead of it.

For example, the Unix password file format is a text encoding with variable length fields; it uses a ':' to separate fields. You can't tell how long a field is until you come across the next ':' or the end of the line. This requires different handling from a binary format encoded using ASN.1¹, where fields can have a type indicator value (usually a byte) and a length (can be 1, 2 or 4 bytes, depending on type) before the actual data for the field.

¹ ASN.1 is (justifiably) regarded as very complex; I've given a very simple example of roughly how it is used that can be criticized on many levels. Nevertheless, the basic idea is valid - length (and with ASN.1, usually type too) precedes the (binary) data. This is also known as TLV - type, length, value - encoding.