Search code examples
c#xmlxmlreaderbyte-order-markxbrl

Parsing and removing BOM/Preamble from XML via filesystem


I am processing XBRL files, and ran in to a bunch of them that have a Byte-Order-Mark (BOM) at the start. If I manually remove it, I can process the file without any issue.

I've had several failed attempts to remove the BOM from the start of the XML files that I am reading from.

This is the error message I am receiving:

Data at the root level is invalid. Line 1, position 1.

Originally I was using XDocument.Load(filename) but this was failing with the same error, so I modified the code after gaining advice from Parsing xml string to an xml document fails if the string begins with <?xml... ?> section without success.

void Main()
{
    XDocument doc;
    var @filename = @"C:\accounts\toprocess\2008\Prod224_8998_00741575_20080630.xml";
    byte[] file = File.ReadAllBytes(filename);
    using (MemoryStream memory = new MemoryStream(file))
    {
        using (XmlTextReader oReader = new XmlTextReader(memory))
        {
            doc = XDocument.Load(oReader);
        }
    }
}

The XML file can be found here: http://s000.tinyupload.com/download.php?file_id=92333278767554773703&t=9233327876755477370347742

enter image description here


Solution

  • C3 AF C2 BB C2 BF looks to be a double UTF-8 encoded BOM. UTF-8 encoding of the BOM is EF BB BF. If you were to treat each of those as a separate character and UTF-8 encode, you'd end up with the sequence that you're seeing.

    So the document you have is broken. Something is taking a document containing a UTF-8 BOM and treating it as extended ASCII. If you can't get the documents fixed at source, I'd be inclined to look for that specific sequence at the start of the file and strip it if present.

    If the documents in question use other extended ASCII characters, there's a good chance they'll be broken too.