Search code examples
c#xmlserializer

XmlSerializer.Serialize BOM missing


I am using this code to store my class:

FileStream stream = new FileStream(myPath, FileMode.Create);
XmlSerializer serializer = new XmlSerializer(typeof(myClass));
serializer.Serialize(stream, myClass);
stream.Close();

This writes a file that I can read alright with XmlSerializer.Deserialize. The generated file, however, is not a proper text file. XmlSerializer.Serialize doesn't store a BOM, but still inserts multibyte characters. Thus it is implicitely declared an ANSI file (because we expect an XML file to be a text file, and a text file without a BOM is considered ANSI by Windows), showing ö as ö in some editors.

Is this a known bug? Or some setting that I'm missing?

Here is what the generated file starts with:

<?xml version="1.0"?>
<SvnProjects xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">

The first byte in the file is hex 3C, i.e the <.


Solution

  • Having or not having a BOM is not a definition of a "proper text file". In fact, I'd say that the most typical format these days is UTF-8 without BOM; I don't think I've ever seen anyone actually use the UTF-8 BOM in real systems! But: if you want a BOM, that's fine: just pass the correct Encoding in; if you want UTF-8 with BOM:

    using (var writer = XmlWriter.Create(myPath, s_settings))
    {
        XmlSerializer serializer = new XmlSerializer(typeof(MyClass));
        serializer.Serialize(writer, obj);
    }
    

    with:

    static readonly XmlWriterSettings s_settings =
        new XmlWriterSettings { Encoding = new UTF8Encoding(true) };
    

    The result of this is a file that starts EF-BB-BF, the UTF-8 BOM.

    If you want a different encoding, then just replace new UTF8Encoding with whatever you did want, remembering to enable the BOM.

    (note: the static Encoding.UTF8 instance has the BOM enabled, but IMO it is better to be very explicit here if you specifically intend to use a BOM, just like you should be very explicit about what Encoding you intended to use)


    Edit: the key difference here is that Serialize(Stream, object) ends up using:

    XmlTextWriter xmlWriter = new XmlTextWriter(stream, encoding: null) {
        Formatting = Formatting.Indented,
        Indentation = 2
    };
    

    which then ends up using:

    public StreamWriter(Stream stream) : this(stream,
        encoding: UTF8NoBOM, // <==== THIS IS THE PROBLEM
        bufferSize: 1024, leaveOpen: false)
    {
    }
    

    so: UTF-8 without BOM is the default if you use that API.