Search code examples
regexsearch.net-2.0whitespaceperformance

C# regex performance with whitespace is too slow


I'm using WinForms NET 2.0 in C#.

I have text files, of about 1000-1500 lines. Certain lines in them begin with 4 or more letter words and I have to add a colon to these words. Having whitespace at the beginning of these lines is optional, and the line can contain more text apart from those words. Here's an example:

    lda $00,x
    mov $20
    rep #$20
    tax
    lda #$0000,y
word
    ...         ; comment
  anotherword           ; this word has whitespace before it.

Also, if there already is a colon, it simply ignores them to prevent more from being added. Here is my code:

Regex R = new Regex(@"^\s*(?<word>[A-Za-z0-9_]{4,})", RegexOptions.Multiline); //keep the words stored in a group called word
MatchCollection M = R.Matches(txt); //let my text file string be "txt"

foreach (Match m in M)
{
    string mm = m.Groups["word"].Value;
    if (!Regex.IsMatch(txt, @"^\s*\b" + mm + @"\b:", RegexOptions.Multiline)) // if already a colon, return
        txt = Regex.Replace(txt, @"^\s*\b" + mm + @"\b", mm + ":", RegexOptions.Multiline);
}

It works and all, but the problem? It's far too slow. I do other operations in the text file, but I've confirmed they are quick and the problem lies in the two "\s*"'s in my regex above. When I remove them both, the search becomes 10x faster.

How could I fix this?


Solution

  • Alternate solution to @TimPietzcker's:

    result = Regex.Replace(subject, @"^(?>(\s*\w{4,}))(?!:)", "$1:", RegexOptions.Multiline);
    

    where (?>...) is an atomic grouping. When the regex engine enters an atomic grouping, it is not allowed to backtrack anywhere in the input this grouping has consumed.

    Now, why is this benefical? Consider a line:

                 ab3 #13 spaces, then a, b, 3
    

    If you do not use an atomic grouping, when the regex fails to match the 4th character in the second quantifier, it must backtrack to the character before a: but it is a space, it doesn't match. And so on until it reaches the character before the beginning of a line, where ^ doesn't match, only then declaring a failure (\s* can match an empty string).

    With an atomic grouping, the engine will not backtrack this way, which is a huge gain of performance, especially when you deal with large data.