c#performance memory-management comparison memory-efficient

How to compare 1,000 images using the available memory efficiently

This is a tough problem. I have around 1,000 images stored in my disk, and I want to find images that are similar to each other by comparing them in pairs. So I have to do around 1,000 * 999 / 2 = 499,500 comparisons (the property of "being similar" is not transitive). My problem is not related with how to compare the images, but with how to manage efficiently the memory of my machine during the comparisons. I have already implemented the comparison function:

static bool AreSimilar(ImageInfo x, ImageInfo y)
{
    // Logic
}

...where ImageInfo is a class that holds the information for one image:

class ImageInfo : IDisposable
{
    public string Path { get; init; }
    public System.Drawing.Image Image { get; init; }
    public void Dispose() => Image.Dispose();
}

Ideally I would like to load all 1,000 images in memory, and then do a nested loop and invoke the AreSimilar method for each pair, but the memory needed for loading all of them at once exceeds by far the available memory of my machine. The image files are quite large, and their size varies considerably (most of them have sizes between 5 and 50 MB). The available RAM is 2 GB, so I can't have more than ~80 images loaded at the same time. Loading an image form the disk is quite slow. It is actually a lot slower to load two images from the disk, than to compare them and find whether they are similar.

My question is how can I implement a method that will have the responsibility of loading/unloading the images from the disk, and yielding them in pairs, while taking advantage of all the available memory, but without exceeding the memory limit. Here is the signature of the method that I am trying to implement:

static IEnumerable<(TItem, TItem)> GetPairs<TSource, TItem>(
    IReadOnlyList<TSource> source,
    Func<TSource, long> sizeSelector,
    Func<TSource, TItem> itemLoader,
    long maxConcurrentSize) where TItem : IDisposable;

The TSource will be the path of the file, and the TItem will be an ImageInfo. I am intending to use it like this:

string[] paths = Directory.GetFiles(@"C:\Images", "*.jpg");
var pairs = GetPairs(paths,
    path => new FileInfo(path).Length,
    path => new ImageInfo() { Path = path, Image = Image.FromFile(path) },
    2_000_000_000);
foreach (var (x, y) in pairs)
{
    if (AreSimilar(x, y))
        Console.WriteLine($"{x.Path} and {y.Path} are similar!");
}

I am currently out of ideas of how to implement this method. It looks like a serious undertaking. All I have right now is the simple version below, that loads the images in pairs and ignores the sizeSelector and maxConcurrentSize parameters:

static IEnumerable<(TItem, TItem)> GetPairs<TSource, TItem>(
    IReadOnlyList<TSource> source,
    Func<TSource, long> sizeSelector,
    Func<TSource, TItem> itemLoader,
    long maxConcurrentSize) where TItem : IDisposable
{
    for (int i = 0; i < source.Count; i++)
    {
        using var first = itemLoader(source[i]);
        for (int j = i + 1; j < source.Count; j++)
        {
            using var second = itemLoader(source[j]);
            yield return (first, second);
        }
    }
}

Obviously the performance is terrible, since each image is loaded ~500 times on average.

Solution

I managed to come up with a compact algorithm that solves this quite difficult problem. The algorithm maintains 5 pieces of information as state:

A list with the items that are currently loaded.
The total size of the currently loaded items.
A queue with the items that are not currently loaded.
An array containing the number of pairs that are remaining for each item.
A BitArray storing whether any particular pair has been emitted.

The algorithm moves items between the two collections loaded and unloaded, loads and unloads items in respect to the maxConcurrentSize policy, yields as many pairs from the loaded list is possible, and continues until both collections are empty. At that point all possible pairs are expected to be emitted, provided that the algorithm is correct.

public static IEnumerable<(TItem, TItem)> GetPairs<TSource, TItem>(
    IReadOnlyList<TSource> source,
    Func<TSource, long> sizeSelector,
    Func<TSource, TItem> itemLoader,
    long maxConcurrentSize) where TItem : IDisposable
{
    List<(int Index, TItem Item, long Size)> loaded = new();
    try
    {
        long loadedSumSize = 0L;
        Queue<int> unloaded = new(Enumerable.Range(0, source.Count));
        int[] pending = new int[source.Count];
        for (int i = 0; i < pending.Length; i++) pending[i] = source.Count - 1;
        BitArray emittedPairs = new(checked(source.Count * (source.Count - 1) >> 1));
        while (unloaded.Count > 0)
        {
            // Select the next item to load, in FIFO order.
            int indexToLoad = unloaded.Dequeue();
            long sizeToLoad = Math.Max(0L, sizeSelector(source[indexToLoad]));

            // Unload already loaded items until there is enough space for the
            // new item, in LIFO order.
            while (loaded.Count > 1 && loadedSumSize + sizeToLoad > maxConcurrentSize)
            {
                var toUnload = loaded[loaded.Count - 1];
                loaded.RemoveAt(loaded.Count - 1);
                toUnload.Item.Dispose();
                unloaded.Enqueue(toUnload.Index);
                loadedSumSize -= toUnload.Size;
            }

            // Load the new item.
            var itemToLoad = itemLoader(source[indexToLoad]);
            loaded.Add((indexToLoad, itemToLoad, sizeToLoad));
            checked { loadedSumSize += sizeToLoad; }
            var justLoaded = loaded[loaded.Count - 1];

            // Yield pairs constituted of the new item with all already loaded items.
            for (int i = 0; i < loaded.Count - 1; i++)
            {
                var existing = loaded[i];
                Debug.Assert(existing.Index != justLoaded.Index);

                // Switch the order in case the pair is not in the correct order.
                var (x, y) = existing.Index < justLoaded.Index ?
                    (existing, justLoaded) : (justLoaded, existing);

                // Emit the pair if it's not already emitted.
                int emittedIndex = (y.Index * (y.Index - 1) >> 1) + x.Index;
                if (emittedPairs[emittedIndex]) continue;
                yield return (x.Item, y.Item);
                emittedPairs[emittedIndex] = true;
                pending[x.Index]--;
                pending[y.Index]--;
            }

            // Unload all items that are completed.
            for (int i = loaded.Count - 1; i >= 0; i--)
            {
                var (index, item, size) = loaded[i];
                if (pending[index] > 0) continue;
                Debug.Assert(pending[index] == 0);
                loaded.RemoveAt(i);
                item.Dispose();
                loadedSumSize -= size;
            }
        }

        // If the algorithm is correct, these final assertions should never fail.
        Debug.Assert(loaded.Count == 0);
        Debug.Assert(loadedSumSize == 0L);
        Debug.Assert(pending.All(n => n == 0));
        Debug.Assert(emittedPairs.Cast<bool>().All(b => b));
    }
    finally
    {
        // Ensure that in case the enumerable is enumerated only partially,
        // all currently loaded items will be disposed.
        foreach (var entry in loaded) entry.Item.Dispose();
    }
}

I measured the efficiency of this implementation by writing a simulation of the original problem:

1,000 files, each having size between 5 and 50 MB.
maxConcurrentSize equal to 2 GB.

var source = Enumerable.Range(1, 1000).ToArray();
var random = new Random(0);
var sizes = source.ToDictionary(x => x,
    _ => (long)random.Next(5_000_000, 50_000_000));
int loaderInvokedCount = 0;
var query = GetPairs(source, x => sizes[x], x =>
{
    loaderInvokedCount++;
    return new MemoryStream();
}, 2_000_000_000);
var emittedPairsCount = query.Count();
Console.WriteLine($"Emitted {emittedPairsCount:#,0} pairs");
Console.WriteLine($"Loader invoked {loaderInvokedCount:#,0} times");

Output:

Emitted 499,500 pairs
Loader invoked 7,682 times

Live demo.

The naive implementation presented as part of the question invokes the loader 500,500 times for the same input, which is 65 times more compared to this implementation. Achieving a 98,5% discount is a quite satisfactory outcome.