Search code examples
c#pdf

How to compare 2 pdfs and return true/false if they are the same (have same content) without using any pdf library?


How can I write a c# method to take 2 pdf filenames and return true if their content is identical, false otherwise. I am not concerned with what the differences are.

What I have tried:

  • Comparing the 2 files as binary.

public static bool PdfFilesHaveTheSameContent(string filePathOne, string filePathTwo)
       {
           var file1 = File.ReadAllBytes(filePathOne);
           var file2 = File.ReadAllBytes(filePathTwo);

           if (file1.Length != file2.Length)
               return false;

           for (var i = 0; i < file1.Length; i++)
           {
               if (file1[i] != file2[i])
               {
                   return false;
               }
           }

           return true;
       }

I have also tried a similar method reading the text (with File.ReadAllText())

I assume the above method/s don't work as there are some meta data being stored with the pdf like a guid or date modified or something. Is it possible to remove this? I have tried opening the pdf in a text editor but it is gibberish and may need decoding somehow.

Everything I've seen online recommends just using one of the many pdf libraries, unfortunately this isn't an option for me. It needs to be done in core .Net only.

Is this possible?


Solution

  • In a comment you clarified

    I basically need to confirm that if both the pdfs were images, would they look the same?

    To check that you unfortunately have to render the PDFs as images. And to render the PDFs as images, quite some code is needed. So in combination with your remark

    Everything I've seen online recommends just using one of the many pdf libraries, unfortunately this isn't an option for me. It needs to be done in core .Net only.

    this means that will have have to add the code equivalent of one of such PDF libraries to your project, at least until core .NET starts offering PDF rendering APIs.


    As an aside: already "as images looking the same" is more complicated than you may think: PDF may look different if rendered on different devices, e.g. two PDFs may look identical on an RGB screen but as a CMYK (probably plus spot colors etc.) print-out they may look really different.