Search code examples
.netpowershellutf-8byte-order-mark

Powershell XMLDocument save as UTF-8 without BOM


I built an XML object of type System.Xml.XmlDocument.

$scheme.gettype()
IsPublic IsSerial Name BaseType                                                         
-------- -------- ---- --------                                                         
True     False    XmlDocument System.Xml.XmlNode 

I use the method save() to save it to a file.

$scheme.save()

This saves the file in format UTF-8 with BOM. The BOM causes issues with other scripts down the line.

When we open the XML file in Notepad++ and save it as UTF-8 (without the BOM), other scripts down the line don't have a problem. So I've been asked to save the script without the BOM.

The MS documentation for the save method states:

The value of the encoding attribute is taken from the XmlDeclaration.Encoding property. If the XmlDocument does not have an XmlDeclaration, or if the XmlDeclaration does not have an encoding attribute, the saved document will not have one either.

The MS documentation on XmlDeclaration lists encoding properties of UTF-8, UTF-16 and others. It does not mention a BOM.

Does the XmlDeclaration have an encoding property that leaves out the BOM?

PS. This behavior is identical in Powershell 5 and Powershell 7.


Solution

  • Unfortunately, the presence of an explicit encoding="utf-8" attribute in the declaration of an XML document causes .NET's [xml] (System.Xml.XmlDocument) type to .Save() the document, when given a file path, to an UTF-8-encoded file with BOM, which can indeed cause problems (even though it shouldn't[1]).

    A request to change this has been green-lighted in principle, but is not yet implemented as of .NET 9.0 (due to a larger discussion about changing [System.Text.Encoding]::UTF8 to not use a BOM, in which case .Save() would automatically not create a BOM anymore either).

    Somewhat ironically, the absence of an encoding attribute causes .Save() to create UTF-8-encoded files without a BOM.

    A simple solution is therefore to remove the encoding attribute[2]; e.g.:

    # Create a sample XML document:
    $xmlDoc = [xml] '<?xml version="1.0" encoding="utf-8"?><foo>bar</foo>'
    
    # Remove the 'encoding' attribute from the declaration.
    # Without this, the .Save() method below would create a UTF-8 file *with* BOM.
    $xmlDoc.ChildNodes[0].Encoding = $null
    
    # Now, saving produces a UTf-8 file *without* a BOM.
    $xmlDoc.Save("$PWD/out.xml")
    

    [1] Per the XML W3C Recommendation: "entities encoded in UTF-8 MAY begin with the Byte Order Mark" [BOM].

    [2] This is safe to do, because the XML W3C Recommendation effectively mandates UTF-8 as the default in the absence of both a BOM and an encoding attribute.