Search code examples
c#filepdfheadercorrupt

Detect if PDF file is correct (header PDF)


I have a windows .NET application that manages many PDF Files. Some of the files are corrupt.

2 issues: I'll try to explain in my imperfect English...sorry

1.)

How can I detect if any pdf file is correct ?

I want to read header of PDF and detect if it is correct.

var okPDF = PDFCorrect(@"C:\temp\pdfile1.pdf");

2.)

How to know if byte[] (bytearray) of file is PDF file or not.

For example, for ZIP files, you could examine the first four bytes and see if they match the local header signature, i.e. in hex

50 4b 03 04

if (buffer[0] == 0x50 && buffer[1] == 0x4b && buffer[2] == 0x03 && buffer[3] == 0x04)

If you are loading it into a long, this is (0x04034b50). by David Pierson

I want the same for PDF files.

byte[] dataPDF = ...

var okPDF = PDFCorrect(dataPDF);

Any sample source code in .NET?


Solution

  • a. Unfortunately, there is no easy way to determine is pdf file corrupt. Usually, the problem files have a correct header so the real reasons of corruption are different. PDF file is effectively a dump of PDF objects. The file contains a reference table giving the exact byte offset locations of each object from the start of the file. So, most probably corrupted files have a broken offsets or may be some object is missed.

    The best way to detect the corrupted file is to use specialized PDF libraries. There are lots of both free and commercial PDF libraries for .NET. You may simply try to load PDF file with one of such libraries. iTextSharp will be a good choice.

    b. According to the PDF reference the header of a PDF file usually looks like %PDF−1.X (where X is a number, for the present from 0 to 7). And 99% of PDF files have such header. However, there are some other kinds of headers which Acrobat Viewer accepts and even absence of a header isn't a real problem for PDF viewers. So, you shouldn't treat file as corrupted if it does not contain a header. E.g., the header may be appeared somewhere within the first 1024 bytes of the file or be in the form %!PS−Adobe−N.n PDF−M.m

    Just for your information I am a developer of the Docotic PDF library.