Search code examples
c#performancememory-managementstring-matchinglarge-files

Searching numerous very large (50GB+) txt files for text matching


I am not sure how I should handle these large files. I have numerous 50GB+ files that I need to search through and text match. Obviously I cant load the entire file into memory (or at least my computer can not) so how would I go about loading these and searching.

I'm guessing I would load parts of the file into memory and then search and save my results then load the next part and eventually move onto the next 50GB+ file and keep track of my results but I'm not sure exactly how to handle this. Any ideas? Specific functions I should be using for memory management and string management?

Id like to do it in C#, I'm doing this as project for work but I'd also like to learn as much as I can so I would like to write the code as oppose to loading it into a large database and searching.

Speed is also a concern.


Solution

  • Assuming you have new lines, then it's simple enough to use a stream with a good buffer size. FileStreams and alike have an internal buffer and the internal mechanism will read from disk when it needs it allowing you to read an entire file without running into the fundamental .net array size limit or allocating large files into memory.

    Note that anything over 85k will end up on your Large Object Heap anyway, so you might want to be mindful of the size one way or another.

    using var sr = new StreamReader(
       new FileStream("SomeFileName",
          FileMode.Open, 
          FileAccess.Read,
          FileShare.None,
          1024 * 1024,// some nasty buffer size that you have benchmarked for your system
          FileOptions.SequentialScan)); 
    
    while (!sr.EndOfStream)
    {
       if (sr.ReadLine().Contains("bob"))
          return true;
    }
    

    Notes : The buffer size will be key to performance here, SSD's can take a larger size than the old spindal crayon hdds. Determining the right size will require benchmarking