Search code examples
regexvbaxfdf

RegEx capture string which has newline upto first instance of (set of) characters


I am trying to get the first instance of string in following source string

Input string

 ><text color="#FFFF00" creationdate="D:20180307100631+04'00'" flags="print,nozoom,norotate" date="D:20180307100652+04'00'" name="a60915a3-1c23-4f6d-b8d4-fbe0dd4890e9" icon="Comment" page="7" rect="351.308000,135.732000,371.308000,153.732000" subject="Sticky Note" title="saddia"
><contents-richtext
><body xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/" xfa:APIVersion="Acrobat:9.0.0" xfa:spec="2.0.2"
><p dir="ltr"
><span dir="ltr" style="font-size:10.0pt;text-align:left;color:#000000;font-weight:normal;font-style:normal"
>As agreed with WPO that any unspecific area use GEN</span
><span dir="ltr" style="font-size:11.0pt;text-align:left;color:#1D477B;font-weight:normal;font-style:normal"
>&#xD;</span
><span dir="ltr" style="font-size:11.0pt;text-align:left;color:#000000;font-weight:normal;font-style:normal"
>&#xD;</span

I am trying retrieve output as below

page="7" rect="351.308000,135.732000,371.308000,153.732000" subject="Sticky Note" title="saddia"
><contents-richtext
><body xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/" xfa:APIVersion="Acrobat:9.0.0" xfa:spec="2.0.2"
><p dir="ltr"
><span dir="ltr" style="font-size:10.0pt;text-align:left;color:#000000;font-weight:normal;font-style:normal"
>As agreed with WPO that any unspecific area use GEN</span

which is upto first instance of </span.

My RegExp is as below which is picking last occurrence of desired end character group:

page="[0-9]+".+subject="(Text Box|Sticky Note)".+((\s+.+)+);<\/span

I have limited knowledge of RegEx so please bear with me.

The snippet is output XFDF (pdf comment export) but it was getting formatted weirdly so I have used html tagging to format.


Solution

  • In the following regex, the main changes I made were to make the dot lazy, meaning that it stops at the first pattern after the dot. This is to prevent the pattern from combing only once over the entire text.

    page="[0-9]+".+?subject="(?:Text Box|Sticky Note)".+?<\/span
    

    Demo

    Note carefully that in order for the above pattern to work, the regex must be done in DOT ALL mode, meaning that dot also matches across newlines.

    In VBA, which doesn't have a formal DOT ALL mode, we can simulate it using [\s\S]:

    page="[0-9]+"[\s\S]+?subject="(?:Text Box|Sticky Note)"[\s\S]+?<\/span