Search code examples
regexpowershellpowershell-3.0

How to match a substring using regex and get some extra strings before and after the match


In the string below, I want to extract sub-string.

SCENARIO:

  1. As soon as wafers_starts matches in the string, the string should be selected from the just previous match <ac:structured-macro. (There are two in the following example, I only need the one which is just before the wafers_starts)

  2. The sub-string should be selected until it matches the wafers_ends plus the first end tag </ac:structured-macro>.

EXAMPLE CODE:

if ($matches -ne $null) { Remove-Variable $matches }

$confluenceHtml = "<h2>Description</h2><ac:structured-macro ac:macro-id=""77f3n751-39w7-4746-acd4-bee7586449ed"" ac:name=""warning"" ac:schema-version=""1""><ac:parameter ac:name=""title"">Compatibility</ac:parameter><ac:rich-text-body><p class=""auto-cursor-target""><br/></p><table class=""wrapped""><colgroup> <col/> <col/> <col/> </colgroup><tbody><tr><th><p>Prerequisite</p><p>This needs a progressive rollout </p></td></tr></tbody></table><p class=""auto-cursor-target""><br/></p><p class=""auto-cursor-target""><br/></p><ac:structured-macro ac:macro-id=""4657sd53-e024-4ea3-a5e2-4586667542da"" ac:name=""excerpt"" ac:schema-version=""1""><ac:parameter ac:name=""hidden"">true</ac:parameter><ac:parameter ac:name=""atlassian-macro-output-type"">INLINE</ac:parameter><ac:rich-text-body><p>wafers_starts</p></ac:rich-text-body></ac:structured-macro><h2>Deployment Notes</h2><ac:structured-macro ac:macro-id=""77f5e121-31d7-4576-awq4-bej57t6d39ed"" ac:name=""warning"" ac:schema-version=""1""><ac:parameter ac:name=""title"">Compatibility</ac:parameter>  <ac:rich-text-body><p class=""auto-cursor-target""><br/></p><table class=""wrapped""><colgroup> <col/> <col/> <col/> </colgroup><tbody><tr><th><p>Prerequisite</p><p>This needs a progressive rollout 2,3,4,5 and so on</p></td></tr></tbody></table><p class=""auto-cursor-target""><br/></p><ac:structured-macro ac:macro-id=""72d7h552-a5dd-44cc-a4re-6f3247574fbd"" ac:name=""excerpt"" ac:schema-version=""1""><ac:parameter ac:name=""hidden"">true</ac:parameter><ac:parameter ac:name=""atlassian-macro-output-type"">INLINE</ac:parameter><ac:rich-text-body><p>wafers_ends</p></ac:rich-text-body></ac:structured-macro><p class=""auto-cursor-target""><br/></p></ac:structured-macro>"

if ($confluenceHtml -match '\<ac:structured-macro.+?wafers_starts([\s\S]*)wafers_ends.+?\<\/ac:structured-macro\>') {  
    $matches[0]
}

OUTPUT:

<ac:structured-macro ac:macro-id="77f3n751-39w7-4746-acd4-bee7586449ed" ac:name="warning" ac:schema-version="1"><ac:parameter ac:name="title">Compatibility</ac:parameter><ac:rich-text-body><p class="auto-cursor-target"><br/></p><table class="wrapped"><colgroup> <col/> <col/> <col/> </colgroup><tbody><tr><th><p>Prerequisite</p><p>This needs a progressive rollout </p></td></tr></tbody></table><p class="auto-cursor-target"><br/></p><p class="auto-cursor-target"><br/></p><ac:structured-macro ac:macro-id="4657sd53-e024-4ea3-a5e2-4586667542da" ac:name="excerpt" ac:schema-version="1"><ac:parameter ac:name="hidden">true</ac:parameter><ac:parameter ac:name="atlassian-macro-output-type">INLINE</ac:parameter><ac:rich-text-body><p>wafers_starts</p></ac:rich-text-body></ac:structured-macro><h2>Deployment Notes</h2><ac:structured-macro ac:macro-id="77f5e121-31d7-4576-awq4-bej57t6d39ed" ac:name="warning" ac:schema-version="1"><ac:parameter ac:name="title">Compatibility</ac:parameter>  <ac:rich-text-body><p class="auto-cursor-target"><br/></p><table class="wrapped"><colgroup> <col/> <col/> <col/> </colgroup><tbody><tr><th><p>Prerequisite</p><p>This needs a progressive rollout 2,3,4,5 and so on</p></td></tr></tbody></table><p class="auto-cursor-target"><br/></p><ac:structured-macro ac:macro-id="72d7h552-a5dd-44cc-a4re-6f3247574fbd" ac:name="excerpt" ac:schema-version="1"><ac:parameter ac:name="hidden">true</ac:parameter><ac:parameter ac:name="atlassian-macro-output-type">INLINE</ac:parameter><ac:rich-text-body><p>wafers_ends</p></ac:rich-text-body></ac:structured-macro>

PROBLEM:

The ending of the sub-string is OK. However, even after several tries wasn't able to get the beginning of the substring. The regex is including from the beginning of the first occurance of <ac:structured-macro.

DESIRED OUTPUT:

I only want to the below sub-string, which contains <ac:structured-macro only once, right before the first matching string wafers_starts

<ac:structured-macro ac:macro-id="4657sd53-e024-4ea3-a5e2-4586667542da" ac:name="excerpt" ac:schema-version="1"><ac:parameter ac:name="hidden">true</ac:parameter><ac:parameter ac:name="atlassian-macro-output-type">INLINE</ac:parameter><ac:rich-text-body><p>wafers_starts</p></ac:rich-text-body></ac:structured-macro><h2>Deployment Notes</h2><ac:structured-macro ac:macro-id="77f5e121-31d7-4576-awq4-bej57t6d39ed" ac:name="warning" ac:schema-version="1"><ac:parameter ac:name="title">Compatibility</ac:parameter>  <ac:rich-text-body><p class="auto-cursor-target"><br/></p><table class="wrapped"><colgroup> <col/> <col/> <col/> </colgroup><tbody><tr><th><p>Prerequisite</p><p>This needs a progressive rollout 2,3,4,5 and so on</p></td></tr></tbody></table><p class="auto-cursor-target"><br/></p><ac:structured-macro ac:macro-id="72d7h552-a5dd-44cc-a4re-6f3247574fbd" ac:name="excerpt" ac:schema-version="1"><ac:parameter ac:name="hidden">true</ac:parameter><ac:parameter ac:name="atlassian-macro-output-type">INLINE</ac:parameter><ac:rich-text-body><p>wafers_ends</p></ac:rich-text-body></ac:structured-macro>

QUESTION:

Looking for correted/working regex pattern.


Solution

  • You need to use this regex, which uses a tempered greedy token (?:(?!ac:structured-macro).)+ pattern to reject any further matching of ac:structured-macro after the first matching of it.

    <ac:structured-macro(?:(?!ac:structured-macro).)+wafers_starts([\s\S]*)wafers_ends.+?<\/ac:structured-macro>
    

    Demo