HttpClient: Correct order to detect encoding

I'm using HttpClient to fetch some files. I put the content into a byte array (bytes). Now I need to detect the encoding. The contenttype will be either html, css, JavaScript or XML contenttype.

Currently I check the charset from headers, then check for a BOM (byte order mark) before I finally check the first part of the file for a charset meta tag. Normally this works fine, because there are no conflicts.

But: Is that order correct (in case of conflict)?

The code I corrently use:

Encoding encoding;
try
{
    encoding = Encoding.GetEncoding(responseMessage.Content.Headers.ContentType.CharSet);
}
catch
{
    using (MemoryStream ms = new MemoryStream(bytes))
    {
        using (StreamReader sr = new StreamReader(ms, Encoding.Default, true))
        {
            char[] chars = new char[1024];
            sr.Read(chars, 0, 1024);
            string textDefault = new string(chars);
            if (sr.CurrentEncoding == Encoding.Default)
            {
                encoding = Global.EncodingFraContentType(textDefault);
            }
            else
            {
                encoding = sr.CurrentEncoding;
            }
        }
    }
}
responseInfo.Text = encoding.GetString(bytes);

Global.EncodingFraContentType is a regular expression that finds the charset defined either in XML declaration, or in a meta tag.

What order is the correct to detect charset/encoding?

Solution

Conclusion - in order of importance:

Byte Order Mark (BOM): If present, this is AUTHORATIVE, since it was added by the editor that actually saved the file (this can only be present on unicode encodings).
Content-Type charset (in header set by the server): For dynamically created/processed files, it should be present (since the server knows), but might not be for static files (the server just sends those).
Inline charset: For xml, html and css the encoding can be be specified inside the document, in either xml prolog, html meta tag or @charset in css. To read that you need to decode the first part of the document using for instance 'Windows-1252' encoding.
Assume utf-8. This is the standard of the web and is today by far the most used.
If the found encoding equals 'ISO-8859-1', use 'Windows-1252' instead (required in html5 - read more at Wikipedia

Now try to decode the document using the found encoding. If error handling is turned on, that might fail! In that case:

Use 'Windows-1252'. This was the standard in old windows files and works fine as last try (there's still a lot of old files out there). This will never throw errors. However it might of course be wrong.

I have made a method that implements this. The regex I use is able to find encodings specified as:

Xml: <?xml version="1.0" encoding="utf-8"?> OR <?xml encoding="utf-8"?>

html: <meta charset="utf-8" /> OR <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

css: @charset "utf-8";

(It works with both single and double qoutes).

You will need:

using System;
using System.IO;
using System.Net.Http;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;

Here's the method that returns the decoded string (parameters are the HttpClient and the Uri):

public static async Task<string> GetString(HttpClient httpClient, Uri url)
{
    byte[] bytes;
    Encoding encoding = null;
    Regex charsetRegex = new Regex(@"(?<=(<meta.*?charset=|^\<\?xml.*?encoding=|^@charset[ ]?)[""']?)[\w-]+?(?=[""';\r\n])",
        RegexOptions.IgnoreCase | RegexOptions.CultureInvariant | RegexOptions.ExplicitCapture);

    using (HttpResponseMessage responseMessage = await httpClient.GetAsync(url).ConfigureAwait(false))
    {
        responseMessage.EnsureSuccessStatusCode();
        bytes = await responseMessage.Content.ReadAsByteArrayAsync().ConfigureAwait(false);
        string headerCharset = responseMessage?.Content?.Headers?.ContentType?.CharSet;

        byte[] buffer = new byte[0x1000];
        Array.Copy(bytes, buffer, Math.Min(bytes.Length, buffer.Length));
        using (MemoryStream ms = new MemoryStream(buffer))
        {
            using (StreamReader sr = new StreamReader(ms, Encoding.GetEncoding("Windows-1252"), true, buffer.Length, true))
            {
                string testString = await sr.ReadToEndAsync().ConfigureAwait(false);
                if (!sr.CurrentEncoding.Equals(Encoding.GetEncoding("Windows-1252")))
                {
                    encoding = sr.CurrentEncoding;
                }
                else if (headerCharset != null)
                {
                    encoding = Encoding.GetEncoding(headerCharset, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
                }
                else
                {
                    string inlineCharset = charsetRegex.Match(testString).Value;
                    if (!string.IsNullOrEmpty(inlineCharset))
                    {
                        encoding = Encoding.GetEncoding(inlineCharset, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
                    }
                    else
                    {
                        encoding = new UTF8Encoding(false, true);
                    }
                }
                if (encoding.Equals(Encoding.GetEncoding("iso-8859-1")))
                {
                    encoding = Encoding.GetEncoding("Windows-1252", EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
                }
            }
        }
        using (MemoryStream ms = new MemoryStream(bytes))
        {
            try
            {
                using (StreamReader sr = new StreamReader(ms, encoding, false, 0x8000, true))
                {
                    return await sr.ReadToEndAsync().ConfigureAwait(false);
                }
            }
            catch (DecoderFallbackException)
            {
                ms.Position = 0;
                using (StreamReader sr = new StreamReader(ms, Encoding.GetEncoding("Windows-1252"), false, 0x8000, true))
                {
                    return await sr.ReadToEndAsync().ConfigureAwait(false);
                }
            }
        }
    }
}

You should wrap the method call in a try/catch, since HttpClient can throw errors, if the request fails.

Update:

In .Net Core, you don't have the 'Windows-1252' encoding (big mistake IMHO), so here you must settle with 'ISO-8859-1'.