Search code examples
xmlpowershelldata-manipulation

Split an XML File with a PowerShell Based on the Number of Tags


So I have this XML file.

<?xml version="1.0" encoding="UTF-8"?>
<CPR:eCPR xmlns:CPR="http://www.google.com/">
  <CPR:contractorInfo>
    <CPR:contractorName>Company Name</CPR:contractorName>
  </CPR:contractorInfo>
  <CPR:projectInfo>
    <CPR:projectLocation>EARTH</CPR:projectLocation>
  </CPR:projectInfo>
  <CPR:payrollInfo>
    <CPR:statementOfNP>false</CPR:statementOfNP>
    <CPR:employees>
      <CPR:employee>
        <CPR:name id="1111:First One">First One</CPR:name>
      </CPR:employee>
      <CPR:employee>
        <CPR:name id="2222:Second Two">Second Two</CPR:name>
      </CPR:employee>
      <CPR:employee>
        <CPR:name id="3333:Third Three">Third Three</CPR:name>
      </CPR:employee>
      <CPR:employee>
        <CPR:name id="4444:Fourth Four">Fourth Four</CPR:name>
      </CPR:employee>
      <CPR:employee>
        <CPR:name id="5555:Fifth Five">Fifth Five</CPR:name>
      </CPR:employee>
    </CPR:employees>
  </CPR:payrollInfo>
</CPR:eCPR>

I need to split it, so each file will have an "n" employee. for example, if I need each file to have 2 employees then there will be 3 files with the last file have only one employee while keeping every the rest of the tag.

What I want (file1)

<?xml version="1.0" encoding="UTF-8"?>
<CPR:eCPR xmlns:CPR="http://www.google.com/">
  <CPR:contractorInfo>
    <CPR:contractorName>Company Name</CPR:contractorName>
  </CPR:contractorInfo>
  <CPR:projectInfo>
    <CPR:projectLocation>EARTH</CPR:projectLocation>
  </CPR:projectInfo>
  <CPR:payrollInfo>
    <CPR:statementOfNP>false</CPR:statementOfNP>
    <CPR:employees>
      <CPR:employee>
        <CPR:name id="1111:First One">First One</CPR:name>
      </CPR:employee>
      <CPR:employee>
        <CPR:name id="2222:Second Two">Second Two</CPR:name>
      </CPR:employee>
    </CPR:employees>
  </CPR:payrollInfo>
</CPR:eCPR>

Here is what I did so far

$limit = 2
$logpath = "C:\dev\project\doximity\temp.xml"

[xml]$xml = Get-Content $logpath
$nsm = New-Object System.Xml.XmlNamespaceManager($xml.NameTable)
$nsm.AddNamespace("CPR", "http://www.google.com/")

$index = 1
$ref = New-Object Xml.XmlDocument
$ref.XmlResolver = $null

$rows = $xml.SelectNodes("//CPR:employee", $nsm)
$c = $rows.Count
$rows | ForEach-Object {
    if ($index -eq 1) {
        $InsertNode = $ref.CreateElement("CPR", "employees", "http://www.google.com/")
        $InsertNode.InnerXml = ""
        $ref.AppendChild($InsertNode)
    }

    $ref.DocumentElement.AppendChild($ref.ImportNode($_, $true))
    $c--
    if ($index -eq $limit) {
        $index = 1
        $ref.Save("C:\dev\project\doximity\chunck{0:D3}.xml" -f ++$i)
        $ref = New-Object Xml.XmlDocument
        $ref.XmlResolver = $null
        if ($c -lt $limit) { $limit = $c }
    } else {
        $index++
    }
}

And the output is

<CPR:employees xmlns:CPR="http://www.google.com/">
  <CPR:employee>
    <CPR:name id="1111:First One">First One</CPR:name>
  </CPR:employee>
  <CPR:employee>
    <CPR:name id="2222:Second Two">Second Two</CPR:name>
  </CPR:employee>
</CPR:employees>

What am I missing?


Solution

  • jdweng's helpful answer shows a solution based on LINQ-to-XML (System.Xml.Linq.XDocument).

    Here's a streamlined formulation of your own [xml] (System.Xml.XmlDocument)-based attempt:

    • You can take advantage of PowerShell's convenient adaption of the XML DOM, which allows namespace-agnostic drill-down into an XML document's elements and attributes using dot notation, i.e. as if they were properties.
    # Create an XML DOM and load its content from a file.
    # NOTE: Be sure to use a *full path*, because .NET's working dir.
    #       usually differs from PowerShell's
    $xml = [xml]::new() # shorter and more efficient alternative to New-Object Xml.XmlDocument
    $xml.Load("C:\dev\project\doximity\temp.xml") 
    
    # Determine the parent element of interest, using *dot notation*.
    $empsRootElement = $xml.eCPR.payrollInfo.employees
    # Get all child nodes (elements) as an array.
    $empsElements = @($empsRootElement.ChildNodes)
    
    # Determine the chunk size and the number of chunks.
    $n = 2
    $chunks = [math]::Ceiling($empsElements.Count / $n)
    
    # Process each chunk.
    foreach ($i in 0..($chunks-1)) {
      # Remove all child nodes.
      $empsRootElement.RemoveAll()
      # Add the next chunk as the (only) children.
      $empsElements[($i*$n)..(($i+1)*$n-1)].
        ForEach({ $null = $empsRootElement.AppendChild($_) })
      # Save the chunk to a sequence-numbered file.
      $xmlDoc.Save(("C:\dev\project\doximity\chunk{0:D3}.xml" -f (1+$i)))
    }