Search code examples
c#regexlarge-files

In c#, how to scan very large files with a regex - or is it possible to feed a regex one char at a time?


I am building a tool to scan for a regex pattern in many files of arbitrarily large sizes.

I am going to do this in parallel on multiple files concurrently and therefore I want to avoid loading the entire files into memory as they can be arbitrarily large. Therefore I will chunk each file by using memory mapped files and view accessors and then scan each chunk.

My questions are:

  1. How to ensure that no matches are missed if the matching text is overlapping two chunks?

  2. If the Regex could be fed a char at a time and then fire events when a match is found then that could help. I would probably still have to have a max on allowed match sizes, but is it possible to do that with Regex?


Solution

  • If you don't allow spanning lines, your buffer only needs to be one line long.

    If you do allow spanning lines, then your buffers need to be as large as the largest look-back you allow + the longest line you allow.

    In direct response to your question, you can't do "one character at a time" because the entire search and target strings need to exist for the comparison to work.