Search code examples
regexnotepad++newlinemultiline

Notepad++ regular expression: Search for long strings which could contain newlines


Given a file containing 100000+ log messages like:

2017-08-10T14:49:09: Debug: D-UNK-000-000: [Event Processor] connectorStatus:   Pending
2017-08-10T14:49:09: Debug: D-UNK-000-000: [Event Processor] context:   <DataItem type="System.Availability.StateData" time="2017-08-04T01:10:59.9525690+02:00"><ManagementGroupId>{05120214-5C27-A4EE-D32B-09CB2239421C}</ManagementGroupId><Property Name="Details" VariantType="8">There are 1 messages attached



03.08.2017 21:00:12

Title: Mail sync issue



User Impact: Users are unable to sync emails using Apple Mail on their Mac computers.

</Property></DataItem>
2017-08-10T14:49:09: Debug: D-UNK-000-000: [Event Processor] context_ManagementGroupId: {05120214-5C27-A4EE-D32B-09CB2239421C}
2017-08-10T14:49:09: Debug: D-UNK-000-000: [Event Processor] context:   null
2017-08-10T14:49:09: Debug: D-UNK-000-000: [Event Processor] context_HealthServiceId:   390382B5-C177-0529-DDC0-F2969F667E49

Every log message starts on a new line beginning with a timestamp. But some log messages extend over multiple lines; in the example above see the 2nd line containing " context:" and then some arbitrary xml with multiple newlines embedded. Thus, in the example above there are exactly 5 log messages.

I'm looking for log messages which are very long, say more than 15000 characters.

I can step through all relevant log messages using Notepad++ searching for this pattern (option ". matches newline" selected):

context:(.+?)2017-0\d-\d\dT\d\d:\d\d:\d\d:

But I failed to extend that it will give me only the long ones.

I expected that the following could work, but no luck (it selects the whole file):

context:(.+?){15000,}2017-0\d-\d\dT\d\d:\d\d:\d\d:


If this is not possible with Notepad++, I am also willing to use other tools, including command line on a linux box.


Not necessary, but if easy doable:
Search for the same what I've explained and replace the whole xml string with its length (number of chars).


Solution

  • You may use

    (?s)context:(?:(?!2017-0\d-\d\dT\d\d:\d\d:\d\d:).){350,}
    

    Explanation:

    • (?s) - DOTALL mode ON (same as . matches newline enabled)
    • context: - a literal substring
    • (?:(?!2017-0\d-\d\dT\d\d:\d\d:\d\d:).){350,} - 350 or more occurrences ({350,}) of any char (.) that does not start a sequence of the 2017-0\d-\d\dT\d\d:\d\d:\d\d: subpattern.

    The (?:(?!).)* is a so called greedy tempered token.

    Adjust the limiting quantifier minimum threshold as you see fit.

    enter image description here