Search code examples
xmlpowershellonix

How to split XML file into smaller files using Powershell


I have large XML files ("ONIX" standard) I'd like to split. Basic structure is:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE ONIXmessage SYSTEM "http://www.editeur.org/onix/2.1/short/onix-international.dtd">
<!-- DOCTYPE is not always present and might look differently -->
<ONIXmessage> <!-- sometimes with an attribute -->
<header>
...
</header> <!-- up to this line every out-file should be identical to source -->
<product> ... </product>
<product> ... </product>
...
<product> ... </product>
<ONIXmessage>

What I want to do is to split this file into n smaller files of approximately same size. For this I'd count the number of <product> nodes, divide them by n and clone them into n new xml files. I have searched a lot, and this task seems to be harder than I thought.

  1. What I could not solve so far is to clone a new XML document with identical xml declaration, doctype, root element and <header> node, but without <product>s. I could do this using regex but I'd rather use xml tools.
  2. What would be the smartest way to transfer a number of <product> nodes to a new XML document? Object notation, like $xml.ONIXmessage.product | % { copy... }, XPath() queries (can you select n nodes with XPath()?) and CloneNode() or XMLReader/XMLWriter?
  3. The content of the nodes should be identical regarding formatting and encoding. How can this be ensured?

I'd be very grateful for some nudges in the right direction!


Solution

  • One way is to:

    1. Make copies of the xml-file
    2. Remove all productnodes in the copies
    3. Use a loop to copy one product at a time from the original file to one of the copies.
    4. When you reach your product-per-file limit, save the current file (copy) and create a new file.

    Example:

    param($path, [int]$maxitems)
    
    $file = Get-ChildItem $path
    
    ################
    
    #Read file
    $xml = [xml](Get-Content -Path $file.FullName | Out-String)
    $product = $xml.SelectSingleNode("//product")
    $parent = $product.ParentNode
    
    #Create copy-template
    $copyxml = [xml]$xml.OuterXml
    $copyproduct = $copyxml.SelectSingleNode("//product")
    $copyparent = $copyproduct.ParentNode
    #Remove all but one product (to know where to insert new ones)
    $copyparent.SelectNodes("product") | Where-Object { $_ -ne $copyproduct } | ForEach-Object { $copyparent.RemoveChild($_) } > $null
    
    $allproducts = @($parent.SelectNodes("product"))
    $totalproducts = $allproducts.Count
    
    $fileid = 1
    $i = 0
    
    foreach ($p in $allproducts) {
        #IF beggining or full file, create new file
        if($i % $maxitems -eq 0) {
            #Create copy of file
            $newFile = [xml]($copyxml.OuterXml)
            #Get parentnode
            $newparent = $newFile.SelectSingleNode("//product").ParentNode
            #Remove all products
            $newparent.SelectNodes("product") | ForEach-Object { $newparent.RemoveChild($_) } > $null
        }
    
        #Copy productnode
        $cur = $newFile.ImportNode($p,$true)
        $newparent.AppendChild($cur) > $null
    
        #Add 1 to "items moved"
        $i++ 
    
        #IF Full file, save
        if(($i % $maxitems -eq 0) -or ($i -eq $totalproducts)) {
            $newfilename = $file.FullName.Replace($file.Extension,"$fileid$($file.Extension)")
            $newFile.Save($newfilename)
            $fileid++
        }
    
    }
    

    UPDATE: Since performance was important here, I created a new version of the script that uses a foreach-loop and a xml-template for the copies to remove 99% of the read-operations and delete-operations. The concept is still the same, but it's executed in a different way.

    Benchmark:

    10 items, 3 per xml OLD solution: 0.0448831 seconds
    10 items, 3 per xml NEW solution: 0.0138742 seconds
    16001 items, 1000 per xml items OLD solution: 73.1934346 seconds
    16001 items, 1000 per xml items NEW solution: 5.337443 seconds