Search code examples
c#comparisonreference-typearrays

What is the fastest way to find if an array of byte arrays contains another byte array?


I have some code that is really slow. I knew it would be and now it is. Basically, I am reading files from a bunch of directories. The file names change but the data does not. To determine if I have read the file, I am hashing it's bytes and comparing that to a list of hashes of already processed files. There are about 1000 files in each directory, and figuring out what's new in each directory takes a good minute or so (and then the processing starts). Here's the basic code:

public static class ProgramExtensions
{
    public static byte[] ToSHA256Hash(this FileInfo file)
    {
        using (FileStream fs = new FileStream(file.FullName, FileMode.Open))
        {
            using (SHA256 hasher = new SHA256Managed())
            {
                return hasher.ComputeHash(fs);
            }
        }
    }
    public static string ToHexString(this byte[] p)
    {

        char[] c = new char[p.Length * 2 + 2];

        byte b;

        c[0] = '0'; c[1] = 'x';

        for (int y = 0, x = 2; y < p.Length; ++y, ++x)
        {
            b = ((byte)(p[y] >> 4));

            c[x] = (char)(b > 9 ? b + 0x37 : b + 0x30);

            b = ((byte)(p[y] & 0xF));

            c[++x] = (char)(b > 9 ? b + 0x37 : b + 0x30);
        }

        return new string(c);

    }
}

class Program
{
    static void Main(string[] args)
    {
        var allFiles = new DirectoryInfo("c:\\temp").GetFiles("*.*");

        List<string> readFileHashes = GetReadFileHashes();

        List<FileInfo> filesToRead = new List<FileInfo>();

        foreach (var file in allFiles)
        {
            if (readFileHashes.Contains(file.ToSHA256Hash().ToHexString()))
                filesToRead.Add(file);
        }

        //read new files
    }
}

Is there anyway I can speed this up?


Solution

  • I believe you can archive the most significant performance improvement by simply first checking the filesize, if the filesize does not match, you can skip the entire file and don't even open it.

    Instead of just saving a list of known hashes, you would also keep a list of known filesizes and only do a content comparison when filesizes match. When filesize doesn't match, you can save yourself from even looking at the file content.

    Depending on the general size your files generally have, a further improvement can be worthwhile:

    • Either doing a binary compare with early abort when the first byte is different (saves reading the entire file which can be a very significant improvement if your files generally are large, any hash algorithm would read the entire file. Detecting that the first byte is different saves you from reading the rest of the file). If your lookup file list likely contains many files of the same size so you'd likely have to do a binary comparison against several files instead consider:

    • hashing in blocks of say 1MB each. First check the first block only against the precalculated 1st block hash in your lookup. Only compare 2nd block if 1st block is the same, saves reading beyond 1st block in most cases for different files. Both those options are only really worth the effort when your files are large.

    I doubt that changing the hashing algorithm itself (e.g first check doing a CRC as suggested) would make any significant difference. Your bottleneck is likely disk IO, not CPU so avoiding disk IO is what will give you the most improvement. But as always in performance, do measure.

    Then, if this is still not enough (and only then), experiment with asynchronous IO (remember though that sequential reads are generally faster than random access, so too much random asynchronous reading can hurt your performance)