Search code examples
c#.netfile-comparison

.Net C# PDF files comparison not working with any method


I need to compare two PDF files for equality. The two files need to be identical in content, and I'm not having any success with the proposals found on:

https://stackoverflow.com/a/36108862/2807741

public static bool AreFileContentsEqual(String path1, String path2) =>
              File.ReadAllBytes(path1).SequenceEqual(File.ReadAllBytes(path2));

and

https://stackoverflow.com/a/76917554/2807741

private bool AreFilesEqual(string file1Path, string file2Path)
{
    string file1Hash = "", file2Hash = "";
    SHA1 sha = new SHA1CryptoServiceProvider();

    using (FileStream fs = System.IO.File.OpenRead(file1Path))
    {
        byte[] hash;
        hash = sha.ComputeHash(fs);
        file1Hash = Convert.ToBase64String(hash);
    }

    using (FileStream fs = System.IO.File.OpenRead(file2Path))
    {
        byte[] hash;
        hash = sha.ComputeHash(fs);
        file2Hash = Convert.ToBase64String(hash);
    }

    return (file1Hash == file2Hash);
}

(among other links I've tried).

I'm comparing two "identical" files and they're always returning false (unless I compare a file with itself, only case where it returns true).

The way I created the files to compare is the next:

  1. Word > Write any content > "Save as" > PDF.
  2. Keep the content intact and "Save as" > PDF (with different name)

Maybe something is changing in the second file when saving even I'm not making any modifications to it?

file1.pdf:

file1.pdf

file2.pdf

file2.pdf

Edit 1:

When I say "Identical" I mean identical in content. The PDFs will contain amounts (numbers), and those amounts in the PDF bills must be exactly the same.


Solution

  • Ok, I'll answer myself. iText7 is the way to go, as it can read PDF files content as text.

    Nuget package: https://www.nuget.org/packages/itext7

    public IActionResult Index()
    {
        var exeFilePath = System.Reflection.Assembly.GetExecutingAssembly().Location;
        var workPath = $"{Path.GetDirectoryName(exeFilePath)}\\Assets";
    
        var file1 = $"{workPath}\\testpdfv1.pdf";
        var file2a = $"{workPath}\\testpdfv2equalv1.pdf";
        var file2b = $"{workPath}\\testpdfv2differentv1.pdf";
    
        var fileContents1 = PdfToText(file1);
        var fileContents2 = PdfToText(file2a);
    
        var filesAreEqual = fileContents1 == fileContents2;
    
        return View();
    }
    
    private string PdfToText(string pPdfFileInfo)
    {
        var pdfFileInfo = new FileInfo(pPdfFileInfo);
        var pdfDocument = new PdfDocument(new PdfReader(pdfFileInfo.FullName));
        var strategy = new LocationTextExtractionStrategy();
        var result = "";
        for (int i = 1; i <= pdfDocument.GetNumberOfPages(); ++i)
        {
            var page = pdfDocument.GetPage(i);
            string text = PdfTextExtractor.GetTextFromPage(page, strategy);
            result += text;
        }
        pdfDocument.Close();
    
        return result;
    }