Search code examples
xmlpowershellremovechild

Removing XML nodes to reduce the size of an XML log file to a given size


I'm having some difficulty removing nodes from an xml file. I've found lots of examples of others doing this in powershell through various means, the code below would seem to be identical to many of the other examples I've seen, but I'm not getting the desired behavior.

My goal is to reduce the size of the output XML until it's below 4KB.

The code below doesn't error out, but the count of objects in $updateactivity never changes, so the node doesn't seem to be removing.

This is a log in xml format, so I'm removing the oldest entries first.

sample xml:

<?xml version="1.0" encoding="utf-16"?>
<LogEntries version="1.0" appname="Dell Command | Update" appversion="4.3.0">
    <LogEntry>
        <serviceVersion>2.3.0.36</serviceVersion>
        <appname>DellCommandUpdate</appname>
        <level>Normal</level>
        <timestamp>2022-01-07T13:29:57.9364469-08:00</timestamp>
        <source>UpdateScheduler.UpdateScheduler.Start</source>
        <message>Starting the update scheduler.</message>
        <trace/>
        <data/>
    </LogEntry>
</LogEntries>

code:

    [xml]$dcuxml = get-content "C:\ProgramData\dell\UpdateService\Log\Activity.log"
    $xmllog = $dcuxml.LogEntries
    $update_activity = $xmllog.LogEntry | NotableDCU
    $i = 0
    Do{
        foreach($entry in $update_activity){
            $entry.parentnode.RemoveChild($entry)
            $xmlsize = [System.Text.Encoding]::UTF8.GetByteCount(($update_activity.InnerXml | Out-String)) / 1KB
        }
    }while($xmlsize -gt 3.99)

Solution

  • This is an alternative solution that uses a streaming approach based on XmlReader and XmlWriter only. Compared to my first solution, it does not limit the size of the input file depending on the amount of available RAM.

    While my first solution reads the whole input file into an XmlDocument in memory, this one only keeps as many log entries in memory, as needed for the output file.

    Also it is much faster than the first solution, because it doesn't incur the overhead of creating a DOM (a log file of 63 MB with 100k entries took about 1.5 seconds to process using the current solution, while it took more than 6 minutes(!) using my first solution).

    A disadvantage is that the code is more lengthy than my first solution.

    $inputPath      = "$PWD\log.xml"
    $outputPath     = "$PWD\log_new.xml"
    
    # Maximum size of the output file (which can be slightly larger as we only 
    # count the size of the log entries).
    $maxByteCount   = 4KB
    
    $writerSettings = [Xml.XmlWriterSettings] @{
        Encoding = [Text.Encoding]::Unicode   # UTF-16 as in input document
        # Replace with this line to encode in UTF-8 instead 
        # Encoding = [Text.Encoding]::UTF8
        Indent = $true
        IndentChars = ' ' * 4   # should match indentation of input document
        ConformanceLevel = [Xml.ConformanceLevel]::Auto
    }
    
    $entrySeparator = "`n" + $writerSettings.IndentChars
    
    $totalByteCount = 0
    $queue = [Collections.Generic.Queue[object]]::new()
    
    $reader = $writer = $null
    
    try {
        # Open the input file.
        $reader = [Xml.XmlReader]::Create( $inputPath )
    
        # Create or overwrite the output file.
        $writer = [Xml.XmlWriter]::Create( $outputPath, $writerSettings ) 
        $writer.WriteStartDocument()  # write the XML declaration
    
        # Copy the document root element and its attributes without recursing into child elements.
        $null = $reader.MoveToContent()
        $writer.WriteStartElement( $reader.Name )
        $writer.WriteAttributes( $reader, $false )
    
        # Loop over the nodes of the input file.
        while( $reader.Read() ) {
            # Skip everything that is not an XML element
            if( $reader.NodeType -ne [xml.XmlNodeType]::Element ) {
                continue
            }
    
            # Read the XML of the current element and its children.
            $xmlStr = $reader.ReadOuterXml()
            # Calculate how much bytes the current element takes when written to file.
            $byteCount = $writerSettings.Encoding.GetByteCount( $xmlStr + $entrySeparator )
    
            # Append XML string and byte count to the end of the queue.
            $queue.Enqueue( [PSCustomObject]@{
                xmlStr = $xmlStr
                byteCount = $byteCount
            })
            $totalByteCount += $byteCount
    
            # Remove entries from beginning of queue to ensure maximum size is not exceeded.
            while( $totalByteCount -ge $maxByteCount ) {
                $totalByteCount -= $queue.Dequeue().byteCount
            }
        }
    
        # Write the last log entries, which are below maximum size, to the output file.
        foreach( $entry in $queue ) {
            $writer.WriteString( $entrySeparator )
            $writer.WriteRaw( $entry.xmlStr )
        }
    
        # Finish the document.
        $writer.WriteString("`n")   
        $writer.WriteEndElement()
        $writer.WriteEndDocument()    
    }
    finally {
        # Close the input and output files
        if( $writer ) { $writer.Dispose() }
        if( $reader ) { $reader.Dispose() }
    }
    

    The algorithm basically works like this:

    • Create a queue of custom objects that store the XML and the size in bytes per log entry.
    • For each log entry of the input file:
      • Read the XML of the log entry and calculate the size in bytes (as on disk, applying the output encoding) of the log entry. Add this data to the end of the queue.
      • If necessary, remove log entries from the beginning of the queue to ensure the desired maximum size in bytes is not exceeded.
    • Write the log entries from the queue to the output file.
    • For simplicity we only consider the size of the log entries, so the actual output file could be slightly larger, due to the XML declaration and the document root element.