Search code examples
c#regex.net-corebacktracking

Avoid regex backtracking while parsing big files


I'm parsing some files that meets a pattern to produce an human readable report. I use regex in order to parse that files.

Example of file:

2012-05-10 08:00:00.155: BROADCAST - Body: <?xml version="1.0" encoding="UTF-8" standalone="yes"?><Data></Data>. MessageProperties [headers={X_Day=20120510}]
2012-05-10 08:00:00.155: BROADCAST - Body: <?xml version="1.0" encoding="UTF-8" standalone="yes"?><Data></Data>. MessageProperties [headers={X_Day=20120510}]
2012-05-10 08:00:00.155: REQUEST - Body: <?xml version="1.0" encoding="utf-8"?>
<Data xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <field1>field1.val</field1>
  <field2>field2.val</field2>
</Data>. MessageProperties [headers={X_Day=20120510}, correlationId=[51, 56, 100, 54, 48, 48, 97, 54, 51, 99, 102, 100, 52, 102, 97, 51, 98, 51, 57, 52, 52, 49, 49, 50, 54, 97, 56, 100, 49, 48, 53, 98], other=blabla]

I want to extract the time part, the xml part and the properties part of each record.

Regex

Currently I have this regex expression which gives me what I want (I have no problem in do a later processing to extract the exact bits a need if that can help with the speed of the regex):

((?:[0-9]{1,4}[-| |:|\.])+[0-9]{1,3}): .*Body: ((?:.|>\n|>\r|>\r\n)*\. MessageProperties )(\[.*\])

The files can be big (Like 2000-10000 matches and 100Mb) so I want to optimize it a little. The current problem is all the backtracking I have with that .* before body and (?:.|>\n|>\r\n)* before MessageProperties (I need to include the line breaks explicitly for the third example record I gave).

Is there any way to optimize all this backtracking? I couldn't find a way.

I'm using regex101 to develop it and then I adjust it to .Net.


Solution

  • General Tips and Improvements

    Try to avoid single character alternations, quantify the parts to the right rather than parts to the left and use character classes wherever possible. The unknown texts between two strings are better unrolled using the unroll the loop principle (that is, do not use .* or .*? even when you are tempted to do so).

    Your Solution

    You may use

    ^([0-9]{4}-[- :.0-9]*):\s+[^-]*\s+-\s+Body:\s+([^.]*(?:\.(?!\s+MessageProperties\s)[^.]*)*\.\s+MessageProperties\s+)(\[.*])
    

    See the regex demo

    Details

    • ^ - start of a line (use with RegexOptions.Multiline option, or when (?m) is prepended to the pattern)
    • ([0-9]{4}-[- :.0-9]*) - Group 1:
      • [0-9]{4} - 4 digits
      • - - a hyphen
      • [- :.0-9]* - 0+ digits, ., :, - or space chars -:\s+[^-]*\s+-\s+ - :, 1+ whitespace, 0+ chars other than -, 1+ whitespaces, -, 1+ whitespaces
    • Body: - a substring
    • \s+ - 1+ whitespaces
    • ([^.]*(?:\.(?!\s+MessageProperties\s)[^.]*)*\.\s+MessageProperties\s+) - Group 2:
      • [^.]*(?:\.(?!\s+MessageProperties\s)[^.]*)* - the unrolled (?s:.*?): any 0+ chars other than . followed with 0+ sequences of a . not followed with MessageProperties enclosed with 1+ whitespaces and then any 0+ chars other than .
      • \.\s+ - a . and 1+ whitespaces
      • MessageProperties - a substring
      • \s+ - 1+ whitespaces
    • (\[.*]) - Group 3: a [ followed with any 0+ chars other than a newline as many as possible, and then a ].