Search code examples
c#.netregexreplacefile-handling

How can I find and replace text in a larger file (150MB-250MB) with regular expressions in C#?


I am working with files that range between 150MB and 250MB, and I need to append a form feed (/f) character to each match found in a match collection. Currently, my regular expression for each match is this:

Regex myreg = new Regex("ABC: DEF11-1111(.*?)MORE DATA(.*?)EVEN MORE DATA(.*?)\f", RegexOptions.Singleline);

and I'd like to modify each match in the file (and then overwrite the file) to become something that could be later found with a shorter regular expression:

Regex myreg = new Regex("ABC: DEF11-1111(.*?)\f\f, RegexOptions.Singleline);

Put another way, I want to simply append a form feed character (\f) to each match that is found in my file and save it.

I see a ton of examples on stack overflow for replacing text, but not so much for larger files. Typical examples of what to do would include:

  • Using streamreader to store the entire file in a string, then do a find and replace in that string.
  • Using MatchCollection in combination with File.ReadAllText()
  • Read the file line by line and look for matches there.

The problem with the first two is that is just eats up a ton of memory, and I worry about the program being able to handle all of that. The problem with the 3rd option is that my regular expression spans over many rows, and thus will not be found in a single line. I see other posts out there as well, but they cover replacing specific strings of text rather than working with regular expressions.

What would be a good approach for me to append a form feed character to each match found in a file, and then save that file?

Edit:

Per some suggestions, I tried playing around with StreamReader.ReadLine(). Specifically, I would read a line, see if it matched my expression, and then based on that result I would write to a file. If it matched the expression, I would write to the file. If it didn't match the expression, I would just append it to a string until it did match the expression. Like this:

Regex myreg = new Regex("ABC: DEF11-1111(.?)MORE DATA(.?)EVEN MORE DATA(.*?)\f", RegexOptions.Singleline);

//For storing/comparing our match.
string line, buildingmatch, match, whatremains;
buildingmatch = "";
match = "";
whatremains = "";

//For keep track of trailing bits after our match.
int matchlength = 0;

using (StreamWriter sw = new StreamWriter(destFile))
using (StreamReader sr = new StreamReader(srcFile))
{
    //While we are still reading lines in the file...
    while ((line = sr.ReadLine()) != null)
    {
        //Keep adding lines to buildingmatch until we can match the regular expression.
        buildingmatch = buildingmatch + line + "\r\n";
        if (myreg.IsMatch(buildingmatch)
        {
            match = myreg.Match(buildingmatch).Value;
            matchlength = match.Lengh;
            
            //Make sure we are not at the end of the file.
            if (matchlength < buildingmatch.Length)
            {
                whatremains = buildingmatch.SubString(matchlength, buildingmatch.Length - matchlength);
            }
            
            sw.Write(match, + "\f\f");
            buildingmatch = whatremains;
            whatremains = "";
        }
    }
}

The problem is that this took about 55 minutes to run a roughly 150MB file. There HAS to be a better way to do this...


Solution

  • If you can load the whole string data into a single string variable, there is no need to first match and then append text to matches in a loop. You can use a single Regex.Replace operation:

    string text = File.ReadAllText(srcFile);
    using (StreamWriter sw = new StreamWriter(destfile, false, Encoding.UTF8, 5242880))
    {
         sw.Write(myregex.Replace(text, "$&\f\f"));
    }
    

    Details:

    • string text = File.ReadAllText(srcFile); - reads the srcFile file to the text variable (match would be confusing)
    • myregex.Replace(text, "$&\f\f") - replaces all occurrences of myregex matches with themselves ($& is a backreference to the whole match value) while appending two \f chars right after each match.