Search code examples
c#algorithmperformanceoptimizationfilesystems

Find files duplicates. Optimization


The algorithm is to get all files from the directory, group them up by their size, so we have <long(size), string[](file names)>, then compare files in "file names" buckets. I'm not ok with current performance, it takes around 30-40 sec to calculate duplicates at the first try, and 3-4 sec after. (I guess my hard drive cashes these files)

I'm looking for the ways to optimize this code:

using System;
using System.IO;
using System.Linq;


static bool CompareFiles(string file1, string file2)
{
    if (file1 == file2) return true; 

    
    const int BYTES = 1024 * 10;
    
    using FileStream fs1 = File.Open(file1, FileMode.Open, FileAccess.Read, FileShare.ReadWrite); 
    using FileStream fs2 = File.Open(file2, FileMode.Open, FileAccess.Read, FileShare.ReadWrite); 
    
        byte[] buffer1= new byte[BYTES]; 
        byte[] buffer2= new byte[BYTES];
    

    while (true)
    {
        int len1 = fs1.Read(buffer1, 0, BYTES);
        int len2 = fs2.Read(buffer2, 0, BYTES);

        if (!((ReadOnlySpan<Byte>)buffer1).SequenceEqual((ReadOnlySpan<byte>)buffer2)) return false; // SequenceEqual is bad.
        if (len1 == 0 || len2 == 0) break;
    }

    return true;
    
}

static void CompareInDirectory(string startingDirectory)
{

    
    var dinfo = new DirectoryInfo(startingDirectory); 
    var files = dinfo.EnumerateFiles("",SearchOption.AllDirectories)
        .Select(finfo => new { finfo.FullName, finfo.Length })
        .GroupBy(file => file.Length)
        .ToDictionary(group => group.Key, group => group.Select(file => file.FullName));
    
    foreach (var entry in files)
    {
        var bucket = entry.Value.ToArray();
        for (int i = 0; i < bucket.Length; i++) {
            for (int k = i + 1; k < bucket.Length; k++) {
                if (CompareFiles(bucket[i],bucket[k])) {
                    Console.WriteLine(bucket[i] + " and " + bucket[k] + " are the same");
                }
            }
        }
    }
}

 CompareInDirectory(@"C:\Program Files (x86)\Steam");

I was suggested to use hashes, since "parallel" file accessing is not fast, so it is better to compute hashes and then group files by their value, but I don't really want to do it this way.


Solution

  • You could optimize your current algorithm by taking advantage of the fact that the equality is a transitive property. So if you have found that the files A and B are identical, you don't have to check the remaining files with both A and B. If C is equal to A, it's also equal to B. Or if C is different from to A, it's also different from the B. So you can just compare the C with the A, and omit the comparison between C and B.

    Similarly if you have found that the files A, B and C are identical, you can compare the next file D with A only, and omit the comparisons D-B and D-C.

    The details of implementing this optimization can be tricky, so I'll leave it as an exercise for the reader.

    This optimization can reduce greatly the total number of CompareFiles operations, in case that most files with equal size are also identical. If that's not the case, and it's common to have unique files with common lengths, then this optimization will not yield great benefits. In that case you could consider dividing the files in smaller buckets, by grouping them not only by their size but also by the hash of their content. The same optimization will most likely yield better results, because files with equal hashes are astronomically more likely to be identical¹.

    ¹ Unless the startingDirectory contains files coming from a research for HashDoS attacks or something.


    Update: Here is an implementation of this idea. The GetPairs method below is a generic method that takes a list and a comparison function, and emits all the possible pairs along with the information if each pair consists of equal elements. This implementation makes the fewer comparisons possible, by omitting comparisons between elements that are known to be equal or different with previously compared elements.

    static IEnumerable<(T First, T Second, bool AreEqual)> GetPairs<T>(
        IReadOnlyList<T> source,
        Func<T, T, bool> equalityComparison)
    {
        ArgumentNullException.ThrowIfNull(source);
        ArgumentNullException.ThrowIfNull(equalityComparison);
        int[] state = new int[source.Count];
        Array.Fill(state, -1);
        for (int i = 0; i < source.Count; i++)
        {
            for (int j = i + 1; j < source.Count; j++)
            {
                bool areEqual;
                if (state[i] == -1)
                {
                    // The source[i] is different from all elements in the range [0..i].
                    if (state[j] == -1)
                    {
                        // The source[j] is different from all elements
                        // in the range [0..i].
                        // No precedent information available, we must compare.
                        areEqual = equalityComparison(source[i], source[j]);
                        if (areEqual) state[j] = i;
                    }
                    else
                    {
                        // The source[j] is equal to an element in the range [0..i].
                        // The two elements can't be equal to each other.
                        areEqual = false;
                    }
                }
                else
                {
                    // The source[i] is equal to an element in the range [0..i].
                    // The two elements can only be equal to each other, if they are
                    // both equal to the same antecedent element.
                    areEqual = state[i] == state[j];
                }
                yield return (source[i], source[j], areEqual);
            }
        }
    }
    

    A simple int[] with size equal to the list is enough for keeping track of previous comparisons.

    The beneficial effect of the optimization depends on the percent of duplicates in the list. The highest the percent, the better. As an example, in a list with 100 elements of which only 50 are unique, the GetPairs emitted the expected number of 4,950 pairs doing only 1,767 comparisons (-64.3% discount). The AreEqual information of all 4,950 pairs was correct.

    Usage example:

    static void CompareInDirectory(string startingDirectory)
    {
        DirectoryInfo dinfo = new(startingDirectory);
        string[][] buckets = dinfo.EnumerateFiles("", SearchOption.AllDirectories)
            .Select(f => (f.FullName, f.Length))
            .GroupBy(e => e.Length)
            .OrderBy(g => g.Key)
            .Select(g => g.Select(file => file.FullName).ToArray())
            .ToArray();
    
        foreach (string[] bucket in buckets)
        {
            var pairs = GetPairs(bucket, CompareFiles);
            foreach (var (x, y, areEqual) in pairs)
            {
                if (areEqual) Console.WriteLine(x + " and " + y + " are the same");
            }
        }
    }