Search code examples
c#excelhashcomparemd5

Comparing MD5 hashes between XLSX files with identical content


We have an internal web application that accepts a file of varying formats from a user in order to import large amounts of data into our systems.

One of the more recent upgrades that we implemented was to add a way to detect if a file was previously uploaded, and if so, to present the user with a warning and an option to resubmit the file, or to cancel the upload.

To accomplish this, we're computing the MD5 of the uploaded file, and comparing that against a database table containing the previously uploaded file information to determine if it is a duplicate. If there was a match on the MD5, the warning is displayed, otherwise it inserts the new file information in the table and carries on with the file processing.

The following is the C# code used to generate the MD5 hash:

private static string GetHash(byte[] input)
{
    using (MD5 md5 = MD5.Create())
    {
        byte[] data = md5.ComputeHash(input);

        StringBuilder bob = new StringBuilder();

        for (int i = 0; i < data.Length; i++)
            bob.Append(data[i].ToString("x2").ToUpper());

        return bob.ToString();
    }
}

Everything is working fine and well... with one exception.

Users are allowed to upload .xlsx files for this process, and unfortunately this file type also stores the metadata of the file within the file contents. (This can easily be seen by changing the extension of the .xlsx file to a .zip and extracting the contents [see below].)

Excel Metadata

Because of this, the MD5 hash of the .xlsx files will change with each subsequent save, even if the contents of the file are identical (simply opening and saving the file with no modifications will refresh the metadata and lead to a different MD5 hash).

In this situation, a file with identical records, but created at different times or by different users will slip past the duplicate file detection, and be processed.

My question: is there a way to determine if the content of an .xlsx file matches that of a previous file without storing the file content? In other words: is there a way to generate an MD5 hash of just the contents of an .xlsx file?


Solution

  • You can remove the pieces from the document that should not influence the hash before you compute the hash.

    This can be achieved by extracting all parts of the Open XML package into a single XML document, removing the undesired nodes and computing a hash of the resulting XML document. Note that you will have to recompute the hash for already uploaded Excel files for this to work because the hash now no longer is based on the binary file contents.

    Here is a simple sample program (add a reference to WindowsBase.dll):

    using System;
    using System.IO;
    using System.IO.Packaging;
    using System.Linq;
    using System.Security.Cryptography;
    using System.Text;
    using System.Xml.Linq;
    
    internal class Program
    {
        private static readonly XNamespace dcterms = "http://purl.org/dc/terms/";
    
        private static void Main(string[] args)
        {
            var fileName = args[0];
    
            // open the ZIP package
            var package = Package.Open(fileName);
    
            // convert the package to a single XML document
            var xdoc = OpcToFlatOpc(package);
    
            // remove the nodes we are not interested in
            // here you can add other nodes as well
            xdoc.Descendants(dcterms + "modified").Remove();
    
            // get a stream of the XML and compute the hash
            using (var ms = new MemoryStream())
            {
                xdoc.Save(ms);
                ms.Position = 0;
    
                string md5 = GetHash(ms);
                Console.WriteLine(md5);
            }
        }
    
        private static string GetHash(Stream stream)
        {
            using (var md5 = MD5.Create())
            {
                var data = md5.ComputeHash(stream);
    
                var bob = new StringBuilder();
    
                for (int i = 0; i < data.Length; i++)
                {
                    bob.Append(data[i].ToString("X2"));
                }
    
                return bob.ToString();
            }
        }
    
        private static XDocument OpcToFlatOpc(Package package)
        {
            XNamespace pkg = "http://schemas.microsoft.com/office/2006/xmlPackage";
            var declaration = new XDeclaration("1.0", "UTF-8", "yes");
            var doc = new XDocument(
                declaration,
                new XProcessingInstruction("mso-application", "progid=\"Word.Document\""),
                new XElement(
                    pkg + "package",
                    new XAttribute(XNamespace.Xmlns + "pkg", pkg.ToString()),
                    package.GetParts().Select(GetContentsAsXml)));
    
            return doc;
        }
    
        private static XElement GetContentsAsXml(PackagePart part)
        {
            XNamespace pkg = "http://schemas.microsoft.com/office/2006/xmlPackage";
            if (part.ContentType.EndsWith("xml"))
            {
                using (var partstream = part.GetStream())
                {
                    using (var streamReader = new StreamReader(partstream))
                    {
                        string streamString = streamReader.ReadToEnd();
                        if (!string.IsNullOrEmpty(streamString))
                        {
                          var newXElement =
                            new XElement(
                              pkg + "part",
                              new XAttribute(pkg + "name", part.Uri),
                              new XAttribute(pkg + "contentType", part.ContentType),
                              new XElement(pkg 
                                + "xmlData", XElement.Parse(streamString)));
                            return newXElement;
                        }
    
                        return null;
                    }
                }
            }
    
            using (var str = part.GetStream())
            {
                using (var binaryReader = new BinaryReader(str))
                {
                    int len = (int)binaryReader.BaseStream.Length;
                    var byteArray = binaryReader.ReadBytes(len);
    
                    // the following expression creates the base64String, then chunks
                    // it to lines of 76 characters long
                    string base64String = Convert.ToBase64String(byteArray)
                           .Select((c, i) => new { Character = c, Chunk = i / 76 })
                           .GroupBy(c => c.Chunk)
                           .Aggregate(
                               new StringBuilder(),
                               (s, i) =>
                                   s.Append(
                                       i.Aggregate(
                                           new StringBuilder(),
                                           (seed, it) => seed.Append(it.Character),
                                           sb => sb.ToString()))
                                    .Append(Environment.NewLine),
                               s => s.ToString());
    
                    return new XElement(
                        pkg + "part",
                        new XAttribute(pkg + "name", part.Uri),
                        new XAttribute(pkg + "contentType", part.ContentType),
                        new XAttribute(pkg + "compression", "store"),
                        new XElement(pkg + "binaryData", base64String));
                }
            }
        }
    }