Search code examples
pdfbinaryfiles

PDF content is not enough to reconstruct the PDF?


I open a pdf file "test.pdf" with Vim and copy its content to another text buffer that I save as "copy.pdf". I don't understand why "copy.pdf" is different, can be opened as a pdf (the title shows) but the page is empty.

The same happens when I read the file in Javascript with FileReader.readAsBinaryString and rewrite it to disk, so it is not related to how I copy in Vim.

Even more strange, the Finder says that the copy is actually 30KB bigger.

Where are the hidden bytes?


Solution

  • Usually when I see this sort of behavior and resulting blank pages, it is the result of using a program or process that is treating the binary information of a PDF as text in some form or another - for example, doing CR/LF conversion, tab to space conversion or interpreting the data as UTF-8 instead of binary. Doing any sort of transformation will ruin the binary streams within a PDF and will cause the offset bytes in the cross-reference table to become incorrect, causing the PDF to be unreadable. Perhaps your process of writing back to disk doing CR/LF conversion or otherwise treating your binary blob as non-binary?