I am trying to get the first instance of string in following source string
Input string
><text color="#FFFF00" creationdate="D:20180307100631+04'00'" flags="print,nozoom,norotate" date="D:20180307100652+04'00'" name="a60915a3-1c23-4f6d-b8d4-fbe0dd4890e9" icon="Comment" page="7" rect="351.308000,135.732000,371.308000,153.732000" subject="Sticky Note" title="saddia"
><contents-richtext
><body xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/" xfa:APIVersion="Acrobat:9.0.0" xfa:spec="2.0.2"
><p dir="ltr"
><span dir="ltr" style="font-size:10.0pt;text-align:left;color:#000000;font-weight:normal;font-style:normal"
>As agreed with WPO that any unspecific area use GEN</span
><span dir="ltr" style="font-size:11.0pt;text-align:left;color:#1D477B;font-weight:normal;font-style:normal"
>
</span
><span dir="ltr" style="font-size:11.0pt;text-align:left;color:#000000;font-weight:normal;font-style:normal"
>
</span
I am trying retrieve output as below
page="7" rect="351.308000,135.732000,371.308000,153.732000" subject="Sticky Note" title="saddia"
><contents-richtext
><body xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/" xfa:APIVersion="Acrobat:9.0.0" xfa:spec="2.0.2"
><p dir="ltr"
><span dir="ltr" style="font-size:10.0pt;text-align:left;color:#000000;font-weight:normal;font-style:normal"
>As agreed with WPO that any unspecific area use GEN</span
which is upto first instance of </span
.
My RegExp is as below which is picking last occurrence of desired end character group:
page="[0-9]+".+subject="(Text Box|Sticky Note)".+((\s+.+)+);<\/span
I have limited knowledge of RegEx so please bear with me.
The snippet is output XFDF (pdf comment export) but it was getting formatted weirdly so I have used html tagging to format.
In the following regex, the main changes I made were to make the dot lazy, meaning that it stops at the first pattern after the dot. This is to prevent the pattern from combing only once over the entire text.
page="[0-9]+".+?subject="(?:Text Box|Sticky Note)".+?<\/span
Note carefully that in order for the above pattern to work, the regex must be done in DOT ALL mode, meaning that dot also matches across newlines.
In VBA, which doesn't have a formal DOT ALL mode, we can simulate it using [\s\S]
:
page="[0-9]+"[\s\S]+?subject="(?:Text Box|Sticky Note)"[\s\S]+?<\/span