Search code examples
c#unit-testingvisual-studio-2010utf-8equality

How do I ignore the UTF-8 Byte Order Marker in String comparisons?


I'm having a problem comparing strings in a Unit Test in C# 4.0 using Visual Studio 2010. This same test case works properly in Visual Studio 2008 (with C# 3.5).

Here's the relevant code snippet:

byte[] rawData = GetData();
string data = Encoding.UTF8.GetString(rawData);

Assert.AreEqual("Constant", data, false, CultureInfo.InvariantCulture);

While debugging this test, the data string appears to the naked eye to contain exactly the same string as the literal. When I called data.ToCharArray(), I noticed that the first byte of the string data is the value 65279 which is the UTF-8 Byte Order Marker. What I don't understand is why Encoding.UTF8.GetString() keeps this byte around.

How do I get Encoding.UTF8.GetString() to not put the Byte Order Marker in the resulting string?

Update: The problem was that GetData(), which reads a file from disk, reads the data from the file using FileStream.readbytes(). I corrected this by using a StreamReader and converting the string to bytes using Encoding.UTF8.GetBytes(), which is what it should've been doing in the first place! Thanks for all the help.


Solution

  • Well, I assume it's because the raw binary data includes the BOM. You could always remove the BOM yourself after decoding, if you don't want it - but you should consider whether the byte array should consider the BOM to start with.

    EDIT: Alternatively, you could use a StreamReader to perform the decoding. Here's an example, showing the same byte array being converted into two characters using Encoding.GetString or one character via a StreamReader:

    using System;
    using System.IO;
    using System.Text;
    
    class Test
    {
        static void Main()
        {
            byte[] withBom = { 0xef, 0xbb, 0xbf, 0x41 };
            string viaEncoding = Encoding.UTF8.GetString(withBom);
            Console.WriteLine(viaEncoding.Length);
    
            string viaStreamReader;
            using (StreamReader reader = new StreamReader
                   (new MemoryStream(withBom), Encoding.UTF8))
            {
                viaStreamReader = reader.ReadToEnd();           
            }
            Console.WriteLine(viaStreamReader.Length);
        }
    }