Working with large data sets and memory limitations

I am working with some code for comparing large collections of objects and storing all matches.

Unsurprisingly, I have just encountered a System.OutofMemoryException

How can I go about solving this?

During comparison should I be writing to memory, then have something else write the results to disk/rdbms. i.e. create a buffer.

Solution

In fact it does depend on your environment, particularly on your operation system either x86 or x64. Check more details here: Memory in depth

1.You have advanced scenario where streaming is what you need. Exact solution depends on from where you are pulling the data. In case of pulling the data from SQL database you can use streaming for SqlDataReader which is tightly coupled with async in this case, sample code:

using (SqlDataReader reader = await command.ExecuteReaderAsync(CommandBehavior.SequentialAccess))
{
     if (await reader.ReadAsync())
     {
         if (!(await reader.IsDBNullAsync(0)))
         {
            using (var dataStream = reader.GetStream(0))
            {
                //process the data
            }
          }
      }
}

This link will reveal a bit more details: Retrieving large data set. However, keep in mind that such approach forces you to use async in connection string as well as deal with async code, that is always additional complexity, especially when you want cover it with specs/tests.

2.Yet another approach is batching, i.e. buffering the data up to some acceptable limit and then exposing the batch for consuming code, after that continue fetching new batch of data unless everything is loaded, sample code:

while(true)
{
 int count = 0;
 bool canRead = reader.Read();
 while(canRead)
 {
  canRead = reader.Read();
  count++;
  if (count >= batchSize)
   break;
 }

 if (!canRead)
  break;
}

The size of batch you can roughly calculate by estimating the size of 1 row of data (based on table schema, msdn article) or just make it configurable and play around with the most suitable value. The main advantage of this approach is that you need minimal changes in your code and the code itself remains synchronous. Disadvantage is that you have to keep either active connection or open a new connection every time and instead maintain which records you have already read and which you still need to fetch.

Finally both options force you to take care about some more advanced questions, such as what should you do if only a part of data is fetched and after that the connection is lost (need some fail-over mechanism), ability to cancel long-running retrieving operation after certain timeout, etc.

To conclude, if you do not want to handle additional complexity that large data introduces, delegate this task to whatever is there available on a market, i.e. database or 3rd party framework. If you feel your team have enough skills for this, then go ahead and implement it yourself - keep the result of comparison in disk file, utilize in-memory cache or just push that data into database