Search code examples
xmlpowershellio

Removing nodes from a large XML file using PowerShell


I'm in a situation where I need to take an XML file and remove a bunch of unnecessary nodes but because the file I've been supplied with is around 1.6GB so it's not really feasible to use something like XmlDocument.Load as it'd be very resource heavy.

Given this, I have been trying to solve my issue using both $reader = [System.Xml.XmlReader]::Create($path) and $writer = [System.Xml.XmlWriter]::Create("C:\test\123.xml")

In order to try and remove unnecessary items I tried the following:

# Set the path to your XML file
$path = "C:\test\test.xml"

# Create an XmlReader object to read the file
$reader = [System.Xml.XmlReader]::Create($path)

# Create an XmlWriter object to write the modified XML
$writer = [System.Xml.XmlWriter]::Create("C:\test\123.xml")

# Create a namespace manager and add the namespace prefix and URI
$nsManager = New-Object System.Xml.XmlNamespaceManager($reader.NameTable)
$nsManager.AddNamespace("g", "http://base.google.com/ns/1.0")

# Loop through the XML and remove unwanted nodes
while ($reader.Read()) {
    if ($reader.NodeType -eq "Element") {
      if ($reader.LocalName -eq "Item") {
         # Enter the Item element and loop through its child nodes
            $itemDepth = $reader.Depth
            while ($reader.Read() -and $reader.Depth -gt $itemDepth) {
                Write-Output $reader.LocalName
                # Remove unwanted child nodes of Item element
                if ($reader.NodeType -eq "Element" -and $reader.LocalName -eq "description") {
                    Write-Output Skip
                    $reader.Skip()
                } else {
                    # Write the node to the output file
                    $writer.WriteNode($reader, $false)
                }
            }
      } else{
        $writer.WriteNode($reader, $false)
      }
    }
}

# Clean up
$reader.Close()
$writer.Close()

This approach was maybe 50% of the way there, but the issue I have is that when the parent node is written, it also writes all the children. The inner logic does work but if I remove the outer else it does not create the root of the document so I get an error about invalid XML.

As you'll see below it essentially gets to <channel> and copies everything in between.

For reference I have included a scaled down version of the XML file I've been using.

<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0" xmlns:g="http://base.google.com/ns/1.0" xmlns:c="http://base.google.com/cns/1.0">
    <channel>
        <title>Title</title>
        <link>https://site.test</link>
        <description date="2023-03-07 12:15:08">Some description of my feed.</description>
        <item>
            <g:id>1234-5678-9876</g:id>
            <title>Title</title>
            <description>Description</description>
            <link></link>
            <g:price>146.00 GBP</g:price>
            <g:sale_price>48.70 GBP</g:sale_price>
            <g:google_product_category>Clothing</g:google_product_category>
            <g:product_type>Clothing</g:product_type>
            <g:brand>Jayley</g:brand>
            <g:condition>new</g:condition>
            <g:age_group>Adult</g:age_group>
            <g:color>Lilac</g:color>
            <g:gender>Female</g:gender>
            <g:pattern>Striped</g:pattern>
            <g:size>One Size</g:size>
            <g:item_group_id>5f5a22dbb7c91</g:item_group_id>
            <g:custom_label_0>Womens</g:custom_label_0>
            <g:shipping>
                <g:country>GB</g:country>
                <g:service>Standard Delivery</g:service>
                <g:price>1.99 GBP</g:price>
            </g:shipping>
            <c:count type="string">1</c:count>
        </item>
        <item>
            ...
        </item>
    </channel>
</rss>

Also, for reference, if I remove the outer else you can see it does loop through the children but then the XML is invalid.

enter image description here


Solution

  • The problem is that $writer.WriteNode($reader, $false) processes the current element of the reader recursively. It advances the reader position past the current element.

    So WriteNode() is useless to write XML nodes that should not be completely copied from the input to the output XML. Instead, use the more specific XmlWriter methods WriteStartElement, WriteStartAttribute, WriteString and WriteEndAttribute to build output elements piece-wise.

    This example removes Description elements that are children of Item.

    $inputPath  = 'input.xml'
    $outputPath = 'output.xml'
    
    # Create absolute, native paths for .NET API (which doesn't respect PowerShell's current directory)
    $fullInputPath = Convert-Path -LiteralPath $inputPath
    $fullOutputPath = (New-Item $outputPath -ItemType File -Force).FullName
    
    $reader = $writer = $null
    
    # Hashtable that stores the path segments that lead to the current element
    $elementPath = @{}
    
    try {
        # Create an XmlReader for the input file
        $reader = [Xml.XmlReader]::Create( $fullInputPath )
    
        # Create an XmlWriter for the output file
        $writer = [Xml.XmlWriter]::Create( $fullOutputPath )
    
        # Read first node (XML declaration)
        $null = $reader.Read()
    
        while( -not $reader.EOF ) {
    
            if( $reader.NodeType -eq [Xml.XmlNodeType]::Element ) {
    
                # Keep track of where we are in the element tree
                $elementPath[ $reader.Depth ] = $reader.Name
    
                # If current element is 'Description' and its parent is 'Item', skip it
                if( $reader.Name -eq 'Description' -and $elementPath[ $reader.Depth - 1 ] -eq 'Item' ) {
                    # Skip current element
                    $reader.Skip()
    
                    # Skip any whitespace after element to avoid empty line in output
                    while( -not $reader.EOF -and $reader.NodeType -eq [Xml.XmlNodeType]::Whitespace ) {
                        $reader.Skip()
                    }   
    
                    continue
                }
    
                # Write the start tag of current element
                $writer.WriteStartElement( $reader.Prefix, $reader.LocalName, $reader.NamespaceUri )
                
                if( $reader.HasAttributes ) {
                    # Write the attributes of current element
                    while( $reader.MoveToNextAttribute() ) {
                        $writer.WriteStartAttribute( $reader.Prefix, $reader.LocalName, $reader.NamespaceUri )
                        $writer.WriteString( $reader.Value )
                        $writer.WriteEndAttribute()
                    }                
                }
    
                # Read next node
                $null = $reader.Read()
            }
            else {
                # If NodeType is EndElement, it writes the end tag.
                # Otherwise it copies any non-element node. 
                # Advances reader position as well!
                $writer.WriteNode( $reader, $false )
            }
        }    
    }
    finally {
        # Cleanup
        $reader, $writer | ForEach-Object Dispose
    }