Search code examples
regexnotepad++regex-groupregex-greedyfind-replace

How to remove spaces from a captured wildcard?


I'm trying to alter some XML with Find&Replace in Notepad++ using regex.

This is the specific XML I'm trying to capture:

<category name="Content Server Categories:FOLDER:test category">
    <attribute name="test attribuut"><![CDATA[test]]></attribute>
    <attribute name="test attribuut1"><![CDATA[test1]]></attribute>
</category>

Following 'FIND' regex does the job (for now):

<(category) name="Content Server Categories:(.+?)">(.+)</(category)>

Now i need the XML to be replaced by this:

<category-FOLDER:testcategory name="Content Server Categories:FOLDER:test category">
    <attribute name="test attribuut"><![CDATA[test]]></attribute>
    <attribute name="test attribuut1"><![CDATA[test1]]></attribute>
</category-FOLDER:testcategory>

Currently i tried using this 'REPLACE BY' regex:

<($1-$2) name="Content Server Categories:($2)">($3)</($1-$2)>

But that gives to following output:

<category-FOLDER:test category name="Content Server Categories:FOLDER:test category">
    <attribute name="test attribuut"><![CDATA[test]]></attribute>
    <attribute name="test attribuut1"><![CDATA[test1]]></attribute>
</category-FOLDER:test category>

As you can see i get category-FOLDER:test category instead of category-FOLDER:testcategory

The space(s) needs to be removed..

The problem is that the input can look different. Now it is this:

<category name="Content Server Categories:FOLDER:test category">

But it could look like these examples as well:

<category name="Content Server Categories:FOLDER1:FOLDER2:test category">

<category name="Content Server Categories:FOLDER NAME:test category">

<category name="Content Server Categories:FOLDER NAME: FOLDER NAME1:test category">

<category name="Content Server Categories:FOLDER:test category name">

...

How do I catch all of these correctly and remove the spaces?

EDIT: Almost forgot,

'. Matches newline' is __ON__

Solution

  • One approach could be to do it in 2 steps due to the replacement of the multiple spaces afterwards.

    Get the required structure (Note to use the non greedy version .*? to prevent over matching):

    <(category) name="Content Server Categories:(.+?)">(.+?)</(category)>
    

    Regex demo

    In the replacement use your replacement without the parenthesis or else they would be included in the replacement:

    <$1-$2 name="Content Server Categories:$2">$3</$1-$2>
    

    Then match the spaces making use of repetitive matches using \G:

    (?:</?category-|\G(?!^))\K\s*([\w:]+) (?!name=)
    

    In the replacement replace the whitespaces with capturing group 1 $1

    Explanation

    • (?: Non capturing group
      • </?category-FOLDER Match text with an optional /
      • | Or
      • \G(?!^) Assert position at the end of the previous match
    • ) Close non capturing group
    • \K\s* Forget what was previously matched and then match 0+ whitespace chars
    • ([\w:]+) Capture in group 1 matching 1+ times a word char or :
    • (?!name=) Assert what is on the right is not a not 'name='

    Regex demo