Search code examples
xmlpowershellpester

Remove xml comments based on xml tags inside the comments with Powershell


I want to remove comments in xml files based on the xml tags inside the comment with Powershell.
Constraints:

  • Multi line comments should be supported
  • Keep xml formatting (e.g. do not write everything into a single line or remove indents)
  • Keep file encoding

My function UncommentXmlNode should remove the <!-- ... --> and keep the <InnerXml>. My function UncommentMyTwoNodes should remove comments from two different xml tags. You find two tests:

  1. it "uncomments myFirstOutcommentedXml and mySecondOutcommentedXml" is running smoothly
  2. it "uncomments both if both are in same comment" fails unless you insert (`n)?.*. In that case, 1. breaks.

The tests are fairly easy to understand, if you look at [xml]$expected and the two respective [xml]$inputXml values. The code here is a fully functional Pester test suite to reproduce my issue. You might have to create C:\temp or install Pester v5.

Import-Module Pester

Describe "Remove comments"{
    BeforeAll {
      function UncommentXmlNode {
        param (
            [String] $filePath,
            [String] $innerXmlToUncomment
        )
        $content = Get-Content $filePath -Raw
        $content -replace "<!--(?<InnerXml>$innerXmlToUncomment)-->", '${InnerXml}' | Set-Content -Path $filePath -Encoding utf8
    }

    function UncommentMyTwoNodes {
        param (
          [xml]$inputXml,
          [string]$inputXmlPath
        )    
        UncommentXmlNode -filePath $inputXmlPath -innerXmlToUncomment "<myFirstOutcommentedXml.*" #Add this to make second test work (`n)?.*
        UncommentXmlNode -filePath $inputXmlPath -innerXmlToUncomment "<mySecondOutcommentedXml.*"
    }

[xml]$expected = @"
<myXml>
  <!-- comment I want to keep -->
  <myFirstOutcommentedXml attributeA="xy" attributeB="true" />
  <mySecondOutcommentedXml attributeA="xy" attributeB="true" />
  <myOtherXmlTag attributeC="value" />
  <!-- comment I want to keep -->
</myXml>
"@
  }
    it "uncomments myFirstOutcommentedXml and mySecondOutcommentedXml"{
          [xml]$inputXml = @"
<myXml>
  <!-- comment I want to keep -->
  <!--<myFirstOutcommentedXml attributeA="xy" attributeB="true" />-->
  <!--<mySecondOutcommentedXml attributeA="xy" attributeB="true" />-->
  <myOtherXmlTag attributeC="value" />
  <!-- comment I want to keep -->
</myXml>
"@

      $tempPath = "C:\temp\test.xml"
      $inputXml.Save($tempPath)
      UncommentMyTwoNodes -inputXml $inputXml -inputXmlPath $tempPath
      [xml]$result = Get-Content $tempPath
      $result.OuterXml | Should -be $expected.OuterXml
    }
  
    it "uncomments both if both are in same comment"{
        [xml]$inputXml = @"
<myXml>
  <!-- comment I want to keep -->
  <!--<myFirstOutcommentedXml attributeA="xy" attributeB="true" />
  <mySecondOutcommentedXml attributeA="xy" attributeB="true" />-->
  <myOtherXmlTag attributeC="value" />
  <!-- comment I want to keep -->
</myXml>
"@
      $tempPath = "C:\temp\test.xml"
      $inputXml.Save($tempPath)
      UncommentMyTwoNodes -inputXml $inputXml -inputXmlPath $tempPath
      [xml]$result = Get-Content $tempPath
      $result.OuterXml | Should -be $expected.OuterXml
    }
  }

Solution

  • I made some changes to your code to make it easier to test::

    • first of all just working with plain strings without converting to [xml] and calling .OuterXml
    • second, just working with plain strings and not reading / writing to disk
    • I've also removed all the Pester testing code for the sake of clarity

    So, here's some test data to work with:

    $expected = @"
    <myXml>
      <!-- comment I want to keep -->
      <myFirstOutcommentedXml attributeA="xy" attributeB="true" />
      <mySecondOutcommentedXml attributeA="xy" attributeB="true" />
      <myOtherXmlTag attributeC="value" />
      <!-- comment I want to keep -->
    </myXml>
    "@
    
    # two tags inside separate xml comments
    $inputXml1 = @"
    <myXml>
      <!-- comment I want to keep -->
      <!--<myFirstOutcommentedXml attributeA="xy" attributeB="true" />-->
      <!--<mySecondOutcommentedXml attributeA="xy" attributeB="true" />-->
      <myOtherXmlTag attributeC="value" />
      <!-- comment I want to keep -->
    </myXml>
    "@
    
    # two tags inside a single xml comment
    $inputXml2 = @"
    <myXml>
      <!-- comment I want to keep -->
      <!--<myFirstOutcommentedXml attributeA="xy" attributeB="true" />
      <mySecondOutcommentedXml attributeA="xy" attributeB="true" />-->
      <myOtherXmlTag attributeC="value" />
      <!-- comment I want to keep -->
    </myXml>
    "@
    

    Here's the updated functions:

    function UncommentXmlNode
    {
        param
        (
            [string] $xml,
            [string] $uncomment
        )
        return $xml -replace "(?s)<!--(?<InnerXml><$uncomment.*?)-->", '${InnerXml}'
        #                     ^^^^                           ^^^
        #                     single-line (eats `n)          lazy / non-greedy
    }
    
    function UncommentMyTwoNodes
    {
        param (
          [string] $xml
        )    
        $xml = UncommentXmlNode -xml $xml -uncomment "myFirstOutcommentedXml"
        $xml = UncommentXmlNode -xml $xml -uncomment "mySecondOutcommentedXml"
        return $xml
    }
    

    And here's some example usage:

    (UncommentMyTwoNodes -xml $inputXml1) -eq $expected
    # True
    
    (UncommentMyTwoNodes -xml $inputXml2) -eq $expected
    # True
    

    The differences are:

    • enabling the single-line option in the regex - (?s) - "so that it matches every character, instead of matching every character except for the newline character \n"

    • turning the greedy .* into a lazy .*? by adding a lazy quantifier. This is needed because otherwise (?s) above causes your --> to match the last instance in the input string. Changing it to lazy makes it match the first --> after the opening <!--.

    This works for both your test cases now, but you might find other edge-cases that still fail (including if $uncomment contains regex escape chars)...


    Epilogue

    Treating xml as plain text isn't always the best plan. For example the above function will fail with simple pathological cases - for example:

    • Whitespace in the element text - e.g.:
    <!--<   myFirstOutcommentedXml attributeA="xy" attributeB="true" />-->
         ^^^
    

    A more robust approach would be to parse the xml and then process all the comment nodes to check their contents:

    $comments = ([xml] "...").SelectNodes("//comment()")
    foreach( $comment in $comments )
    {
        ...
    }