In the string below, I want to extract sub-string.
SCENARIO:
As soon as wafers_starts
matches in the string, the string should be selected from the just previous match <ac:structured-macro
. (There are two in the following example, I only need the one which is just before the wafers_starts
)
The sub-string should be selected until it matches the wafers_ends
plus the first end tag </ac:structured-macro>
.
EXAMPLE CODE:
if ($matches -ne $null) { Remove-Variable $matches }
$confluenceHtml = "<h2>Description</h2><ac:structured-macro ac:macro-id=""77f3n751-39w7-4746-acd4-bee7586449ed"" ac:name=""warning"" ac:schema-version=""1""><ac:parameter ac:name=""title"">Compatibility</ac:parameter><ac:rich-text-body><p class=""auto-cursor-target""><br/></p><table class=""wrapped""><colgroup> <col/> <col/> <col/> </colgroup><tbody><tr><th><p>Prerequisite</p><p>This needs a progressive rollout </p></td></tr></tbody></table><p class=""auto-cursor-target""><br/></p><p class=""auto-cursor-target""><br/></p><ac:structured-macro ac:macro-id=""4657sd53-e024-4ea3-a5e2-4586667542da"" ac:name=""excerpt"" ac:schema-version=""1""><ac:parameter ac:name=""hidden"">true</ac:parameter><ac:parameter ac:name=""atlassian-macro-output-type"">INLINE</ac:parameter><ac:rich-text-body><p>wafers_starts</p></ac:rich-text-body></ac:structured-macro><h2>Deployment Notes</h2><ac:structured-macro ac:macro-id=""77f5e121-31d7-4576-awq4-bej57t6d39ed"" ac:name=""warning"" ac:schema-version=""1""><ac:parameter ac:name=""title"">Compatibility</ac:parameter> <ac:rich-text-body><p class=""auto-cursor-target""><br/></p><table class=""wrapped""><colgroup> <col/> <col/> <col/> </colgroup><tbody><tr><th><p>Prerequisite</p><p>This needs a progressive rollout 2,3,4,5 and so on</p></td></tr></tbody></table><p class=""auto-cursor-target""><br/></p><ac:structured-macro ac:macro-id=""72d7h552-a5dd-44cc-a4re-6f3247574fbd"" ac:name=""excerpt"" ac:schema-version=""1""><ac:parameter ac:name=""hidden"">true</ac:parameter><ac:parameter ac:name=""atlassian-macro-output-type"">INLINE</ac:parameter><ac:rich-text-body><p>wafers_ends</p></ac:rich-text-body></ac:structured-macro><p class=""auto-cursor-target""><br/></p></ac:structured-macro>"
if ($confluenceHtml -match '\<ac:structured-macro.+?wafers_starts([\s\S]*)wafers_ends.+?\<\/ac:structured-macro\>') {
$matches[0]
}
OUTPUT:
<ac:structured-macro ac:macro-id="77f3n751-39w7-4746-acd4-bee7586449ed" ac:name="warning" ac:schema-version="1"><ac:parameter ac:name="title">Compatibility</ac:parameter><ac:rich-text-body><p class="auto-cursor-target"><br/></p><table class="wrapped"><colgroup> <col/> <col/> <col/> </colgroup><tbody><tr><th><p>Prerequisite</p><p>This needs a progressive rollout </p></td></tr></tbody></table><p class="auto-cursor-target"><br/></p><p class="auto-cursor-target"><br/></p><ac:structured-macro ac:macro-id="4657sd53-e024-4ea3-a5e2-4586667542da" ac:name="excerpt" ac:schema-version="1"><ac:parameter ac:name="hidden">true</ac:parameter><ac:parameter ac:name="atlassian-macro-output-type">INLINE</ac:parameter><ac:rich-text-body><p>wafers_starts</p></ac:rich-text-body></ac:structured-macro><h2>Deployment Notes</h2><ac:structured-macro ac:macro-id="77f5e121-31d7-4576-awq4-bej57t6d39ed" ac:name="warning" ac:schema-version="1"><ac:parameter ac:name="title">Compatibility</ac:parameter> <ac:rich-text-body><p class="auto-cursor-target"><br/></p><table class="wrapped"><colgroup> <col/> <col/> <col/> </colgroup><tbody><tr><th><p>Prerequisite</p><p>This needs a progressive rollout 2,3,4,5 and so on</p></td></tr></tbody></table><p class="auto-cursor-target"><br/></p><ac:structured-macro ac:macro-id="72d7h552-a5dd-44cc-a4re-6f3247574fbd" ac:name="excerpt" ac:schema-version="1"><ac:parameter ac:name="hidden">true</ac:parameter><ac:parameter ac:name="atlassian-macro-output-type">INLINE</ac:parameter><ac:rich-text-body><p>wafers_ends</p></ac:rich-text-body></ac:structured-macro>
PROBLEM:
The ending of the sub-string is OK. However, even after several tries wasn't able to get the beginning of the substring. The regex is including from the beginning of the first occurance of <ac:structured-macro
.
DESIRED OUTPUT:
I only want to the below sub-string, which contains <ac:structured-macro
only once, right before the first matching string wafers_starts
<ac:structured-macro ac:macro-id="4657sd53-e024-4ea3-a5e2-4586667542da" ac:name="excerpt" ac:schema-version="1"><ac:parameter ac:name="hidden">true</ac:parameter><ac:parameter ac:name="atlassian-macro-output-type">INLINE</ac:parameter><ac:rich-text-body><p>wafers_starts</p></ac:rich-text-body></ac:structured-macro><h2>Deployment Notes</h2><ac:structured-macro ac:macro-id="77f5e121-31d7-4576-awq4-bej57t6d39ed" ac:name="warning" ac:schema-version="1"><ac:parameter ac:name="title">Compatibility</ac:parameter> <ac:rich-text-body><p class="auto-cursor-target"><br/></p><table class="wrapped"><colgroup> <col/> <col/> <col/> </colgroup><tbody><tr><th><p>Prerequisite</p><p>This needs a progressive rollout 2,3,4,5 and so on</p></td></tr></tbody></table><p class="auto-cursor-target"><br/></p><ac:structured-macro ac:macro-id="72d7h552-a5dd-44cc-a4re-6f3247574fbd" ac:name="excerpt" ac:schema-version="1"><ac:parameter ac:name="hidden">true</ac:parameter><ac:parameter ac:name="atlassian-macro-output-type">INLINE</ac:parameter><ac:rich-text-body><p>wafers_ends</p></ac:rich-text-body></ac:structured-macro>
QUESTION:
Looking for correted/working regex pattern.
You need to use this regex, which uses a tempered greedy token (?:(?!ac:structured-macro).)+
pattern to reject any further matching of ac:structured-macro
after the first matching of it.
<ac:structured-macro(?:(?!ac:structured-macro).)+wafers_starts([\s\S]*)wafers_ends.+?<\/ac:structured-macro>