Fastest way to check that a PDF is corrupted (Or just missing EOF) in Ruby?

I am looking for a way to check if a PDF is missing an end of file character. So far I have found I can use the pdf-reader gem and catch the MalformedPDFError exception, or of course I could simply open the whole file and check if the last character was an EOF. I need to process lots of potentially large PDF's and I want to load as little memory as possible.

Note: all the files I want to detect will be lacking the EOF marker, so I feel like this is a little more specific scenario then detecting general PDF "corruption". What is the best, fast way to do this?

Solution

TL;DR

Looking for %%EOF, with or without related structures, is relatively speedy even if you scan the entirety of a reasonably-sized PDF file. However, you can gain a speed boost if you restrict your search to the last kilobyte, or the last 6 or 7 bytes if you simply want to validate that %%EOF\n is the only thing on the last line of a PDF file.

Note that only a full parse of the PDF file can tell you if the file is corrupted, and only a full parse of the File Trailer can fully validate the trailer's conformance to standards. However, I provide two approximations below that are reasonably accurate and relatively fast in the general case.

Check Last Kilobyte for File Trailer

This option is fairly fast, since it only looks at the tail of the file, and uses a string comparison rather than a regular expression match. According to Adobe:

Acrobat viewers require only that the %%EOF marker appear somewhere within the last 1024 bytes of the file.

Therefore, the following will work by looking for the file trailer instruction within that range:

def valid_file_trailer? filename
  File.open filename { |f| f.seek -1024, :END; f.read.include? '%%EOF' }
end

A Stricter Check of the File Trailer via Regex

However, the ISO standard is both more complex and a lot more strict. It says, in part:

The last line of the file shall contain only the end-of-file marker, %%EOF. The two preceding lines shall contain, one per line and in order, the keyword startxref and the byte offset in the decoded stream from the beginning of the file to the beginning of the xref keyword in the last cross-reference section. The startxref line shall be preceded by the trailer dictionary, consisting of the keyword trailer followed by a series of key-value pairs enclosed in double angle brackets (<< … >>) (using LESS-THAN SIGNs (3Ch) and GREATER-THAN SIGNs (3Eh)).

Without actually parsing the PDF, you won't be able to validate this with perfect accuracy using regular expressions, but you can get close. For example:

def valid_file_trailer? filename
  pattern = /^startxref\n\d+\n%%EOF\n\z/m
  File.open(filename) { |f| !!(f.read.scrub =~ pattern) }
end