I am processing files and use magic numbers to identify file type validity.
I am using the Medsea mime-util JAR for Java to investigate the magic number and determine mime. This library accounts for two different PDF sequences it checks from left-to-right:
%PDF-
\xef\xbb\xbf%PDF-
If the PDF does not start with either of those sequences, it is rejected.
I have been given the following file (see image) which opens validly in Acrobat and other viewers; I do not know what the Byte Order Mark (BOM) is for the value preceding the %PDF-.
255044462D
is %PDF-
Here is the HEX sequence with the unidentified BOM:
ACED0005757200025B42ACF317F8060854E0020000787000007CD4255044462D
Is this a valid BOM, and if so, how do I identify it?
UPDATE
Per the answer below, the solution is to check the first 1024 characters for the above sequence. I have solved this in the Medsea mime-util library by altering the magic.mime
file using an undocumented feature the in-line source code details.
Alter this entry:
0 string %PDF- application/pdf ignore pdf
as follows:
0 string>1024 %PDF- application/pdf ignore pdf
This undocumented feature is explained in a comment embedded in the source code of eu.medsea.mimeutil.detector.MagicMimeEntry.java
method readBuffer(byte[])
for MagicMimeEntry.STRING_TYPE
:
// The following is not documented in the Magic(5) documentation.
// This is an extension to the magic rules and is provided by this utility.
// It allows for better matching of some text based files such as XML files
The subsequent code demonstrates parsing a >#
section from the column 2 "type" value and using # for the buffer size to search, from start index indicated by the value for column 1.
Read this answer on a related topic:
According to the PDF standard (ISO 32000-2, similarly also already in ISO 32000-1):
The PDF file begins with the 5 characters “%PDF–”
(ISO 32000-2, section 7.5.2 "File header")
In particular there is nothing like "UTF-8 encoded PDFs (preceded with the UTF-8 Byte Order Mark)", already that BOM is invalid.
Nonetheless, Adobe Reader and other PDF viewers open files with a few leading arbitrary trash bytes as PDFs without complaint. This happens because Adobe Reader explicitly is lax about the specification
Acrobat viewers require only that the header appear somewhere within the first 1024 bytes of the file.
(Adobe PDF Reference sixth edition, appendix H.3 "Implementation Notes", item 13)
and other PDF viewers follow its lead.
Thus, if you want to use magic numbers to identify file type validity as in "valid according to the specification", you must only accept files beginning with the 5 characters “%PDF-”. On the other hand, if you want to judge validity by "opens in common viewers", you have to accept anything with “%PDF-” appearing somewhere within the first 1024 bytes of the file.
Even worse,
Acrobat viewers also accept a header of the form
%!PS−Adobe−N.n PDF−M.m
(Adobe PDF Reference sixth edition, appendix H.3 "Implementation Notes", item 14)
So in this case you also have to accept this sequence in the first 1024 bytes...
I didn't close your question as duplicate of the referenced answer because you appear to believe that there is something like "UTF-8 encoded PDFs", that some BOMs may be valid in front of the “%PDF-” – No, nothing is allowed in front of those header bytes, neither an UTF BOM nor anything else.