Search code examples
c#asp.net-coremd5sum

File comparison using md5 hash or length of a file?


I have a list of files on the disk which I need to get and load it in memory. I created a FileConfig class as shown below which has all the metadata for each file.

public class FileConfig
{
    public string FileName { get; set; }
    public DateTime Date { get; set; }
    public string FileContent { get; set; }
    public string MD5Hash { get; set; }
}

I have a MD5Hash string for each file so that I can compare it later on with some other files to figure out whether particular file is different or not.

Below is my code where I get list of all the files from the disk and then I make a list of FileConfig object from it.

private IEnumerable<FileConfig> LoadFiles(string path)
{
    IList<string> files = procUtility.GetListOfFiles(path);
    if (files == null || files.Count == 0) { yield return default; }

    for (int i = 0; i < files.Count; i++)
    {
        var cfgPath = files[i];
        if (!File.Exists(cfgPath)) { continue; }
        var date = File.GetLastWriteTimeUtc(cfgPath);
        var content = File.ReadAllText(cfgPath);
        var pathPieces = cfgPath.Split(System.IO.Path.DirectorySeparatorChar, StringSplitOptions.RemoveEmptyEntries);
        var fileName = pathPieces[pathPieces.Length - 1];
        var md5Hash = procUtility.GetMD5Hash(cfgPath);
        yield return new FileConfig
        {
            FileName = fileName,
            Date = date,
            FileContent = content,
            MD5Hash = md5Hash
        };
    }
}

My goal at the end is to compare files (and also use file content for some other purpose) so I was using MD5Hash string of each file in FileConfig class and figuring it out whether they are different or not like below:

!newFile.MD5Hash.Equals(oldFile.First().MD5Hash)

Is there any better way by which I can inherit FileInfo class in my FileConfig class and then use length method of each file to do the comparison? or what I have is fine here?


Solution

  • What you have is fine. md5sum is designed to generate a hash based on the file contents; even a slight byte difference would generate a different hash. The chance you would generate false positives from comparing md5sums is in the millions and the contents of the files would have to be drastically different to have that chance.

    However, a byte by byte comparison may be faster in your case as generating check sums load each byte of the file and then process them. If you do require a byte by byte comparison then use System.IO.FileInfo and File.ReadAllBytes(FileInfo fileName).SequenceEqual(File.ReadAllBytes(FileInfo fileName)

    The byte comparison ends at the first difference, which is where we assume it is faster than comparing md5sums as the md5 hash generator will not end at a difference.

    You can also use the following on non-binary files:

    File.ReadLines(file).SequenceEqual(File.ReadLines(file))

    As for comparing by length, you should never want to rely on that. There's little to no benefit for adding the check to it and the false positives generated from relying solely on the check are numerable.