Search code examples
xmlpowershell

Selecting distinct information from multiple similar/identical XML nodes


I have an XML file that looks something like the following - it constitutes an fixed format that I cannot change. The number of <OrderLine> values for each file is not fixed.

<?xml version="1.0" encoding="utf-8"?>
<DTD_ORDER>
   <OrderHead>
      <OrderReferences>12345</OrderReferences>
      <OrderRecipient>
         <OrderLine>
            <LineNumber>1</LineNumber>
            <Product>ProductA</Product>
            <Quantity>1</Quantity>
            <Price>17.50</Price>
            <Discount>0.00</Discount>
            <LineGross>17.50</LineGross>
            <LineNet>14.58</LineNet>
            <LineTax>2.92</LineTax>
         </OrderLine>
         <OrderLine>
            <LineNumber>2</LineNumber>
            <Product>ProductB</Product>
            <Quantity>1</Quantity>
            <Price>27.50</Price>
            <Discount>10.00</Discount>
            <LineGross>27.50</LineGross>
            <LineNet>22.92</LineNet>
            <LineTax>4.58</LineTax>
         </OrderLine>
      </OrderRecipient>
   </OrderHead>
</DTD_ORDER>

These XML files are currently imported into the PowerShell script in question via [XML] (Get-Content $XMLFile) as other operations are done on the file (deduplication, removing values etc).

The data for each <OrderLine> block needs to be appended to a CSV file that is created when the script first runs on the day in question. Information from elsewhere in the XML is currently being appended to a CSV file using the -f Format operator combined with add-content.

The end result should be a CSV like below....

OrderReferences, LineNumber, Product, DateCreated, Price, LineGross, LineNet, LineTax, OrderLineVoucherValue, Discount
12345, 1, ProductA, 20/02/2023, 17.50, 17.50, 14.58, 2.92, 0.00, 0.00
12345, 2, ProductB, 20/02/2023, 27.50, 27.50, 22.92, 2.92, 0.00, 0.00

I need to analyse the file in Powershell and query each <OrderLine> node and read all the tags inside it, extracting them into a CSV file.

I appreciate I could use Powershell's SelectNodes("//OrderLine") command, but as I understand it, that would return all of the tags (i.e. LineNumber, Product) into their own arrays, possibly out of order, when really I need to loop through the orderline tags and work on the sub tags. PowerShell probably has a really simple way of doing this, but I'm struggling to find the answer...

All help appreciated!


Solution

  • Here is a way using Select-Xml:

    $dateCreated = '20/02/2023'
    
    Select-Xml -Path input.xml -XPath '//OrderLine' | ForEach-Object Node | 
        Select-Object @(
            @{ n='OrderReferences'; e={ $_.ParentNode.ParentNode.OrderReferences }}
            'LineNumber'
            'Product'
            @{ n='DateCreated'; e={ $dateCreated }}
            'Quantity', 'Price', 'Discount', 'LineGross', 'LineNet', 'LineTax'
        ) | Export-Csv output.csv -NoTypeInformation
    

    The Select-Object call takes an array that specifies which properties (columns) should appear in the output. The first one is a calculated property that gets the OrderReferences value from the parent node, two levels up. There is another calculated property for the DateCreated column that references a variable.

    The other properties are directly copied from the OrderLine element, so they can be simply specified by name. You can change the property order to your liking. You can either put each name on a separate line or separate them by ,.

    In case you already have the XML document in a variable, just remove the -Path argument and pipe the variable to Select-Xml:

    $xml | Select-Xml -XPath '//OrderLine' …
    

    To append to an existing CSV, just add the -Append argument to the Export-Csv call.

    Output:

    "OrderReferences","LineNumber","Product","Quantity","Price","Discount","LineGross","LineNet","LineTax"
    "12345","1","ProductA","1","17.50","0.00","17.50","14.58","2.92"
    "12345","2","ProductB","1","27.50","10.00","27.50","22.92","4.58"