Search code examples
c#httpencodingdotnet-httpclient

HttpClient: Correct order to detect encoding


I'm using HttpClient to fetch some files. I put the content into a byte array (bytes). Now I need to detect the encoding. The contenttype will be either html, css, JavaScript or XML contenttype.

Currently I check the charset from headers, then check for a BOM (byte order mark) before I finally check the first part of the file for a charset meta tag. Normally this works fine, because there are no conflicts.

But: Is that order correct (in case of conflict)?

The code I corrently use:

Encoding encoding;
try
{
    encoding = Encoding.GetEncoding(responseMessage.Content.Headers.ContentType.CharSet);
}
catch
{
    using (MemoryStream ms = new MemoryStream(bytes))
    {
        using (StreamReader sr = new StreamReader(ms, Encoding.Default, true))
        {
            char[] chars = new char[1024];
            sr.Read(chars, 0, 1024);
            string textDefault = new string(chars);
            if (sr.CurrentEncoding == Encoding.Default)
            {
                encoding = Global.EncodingFraContentType(textDefault);
            }
            else
            {
                encoding = sr.CurrentEncoding;
            }
        }
    }
}
responseInfo.Text = encoding.GetString(bytes);
Global.EncodingFraContentType is a regular expression that finds the charset defined either in XML declaration, or in a meta tag.

What order is the correct to detect charset/encoding?


Solution

  • Conclusion - in order of importance:

    1. Byte Order Mark (BOM): If present, this is AUTHORATIVE, since it was added by the editor that actually saved the file (this can only be present on unicode encodings).
    2. Content-Type charset (in header set by the server): For dynamically created/processed files, it should be present (since the server knows), but might not be for static files (the server just sends those).
    3. Inline charset: For xml, html and css the encoding can be be specified inside the document, in either xml prolog, html meta tag or @charset in css. To read that you need to decode the first part of the document using for instance 'Windows-1252' encoding.
    4. Assume utf-8. This is the standard of the web and is today by far the most used.
    5. If the found encoding equals 'ISO-8859-1', use 'Windows-1252' instead (required in html5 - read more at Wikipedia

    Now try to decode the document using the found encoding. If error handling is turned on, that might fail! In that case:

    1. Use 'Windows-1252'. This was the standard in old windows files and works fine as last try (there's still a lot of old files out there). This will never throw errors. However it might of course be wrong.

    I have made a method that implements this. The regex I use is able to find encodings specified as:

    Xml: <?xml version="1.0" encoding="utf-8"?> OR <?xml encoding="utf-8"?>

    html: <meta charset="utf-8" /> OR <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

    css: @charset "utf-8";

    (It works with both single and double qoutes).

    You will need:

    using System;
    using System.IO;
    using System.Net.Http;
    using System.Text;
    using System.Text.RegularExpressions;
    using System.Threading.Tasks;
    

    Here's the method that returns the decoded string (parameters are the HttpClient and the Uri):

    public static async Task<string> GetString(HttpClient httpClient, Uri url)
    {
        byte[] bytes;
        Encoding encoding = null;
        Regex charsetRegex = new Regex(@"(?<=(<meta.*?charset=|^\<\?xml.*?encoding=|^@charset[ ]?)[""']?)[\w-]+?(?=[""';\r\n])",
            RegexOptions.IgnoreCase | RegexOptions.CultureInvariant | RegexOptions.ExplicitCapture);
    
        using (HttpResponseMessage responseMessage = await httpClient.GetAsync(url).ConfigureAwait(false))
        {
            responseMessage.EnsureSuccessStatusCode();
            bytes = await responseMessage.Content.ReadAsByteArrayAsync().ConfigureAwait(false);
            string headerCharset = responseMessage?.Content?.Headers?.ContentType?.CharSet;
    
            byte[] buffer = new byte[0x1000];
            Array.Copy(bytes, buffer, Math.Min(bytes.Length, buffer.Length));
            using (MemoryStream ms = new MemoryStream(buffer))
            {
                using (StreamReader sr = new StreamReader(ms, Encoding.GetEncoding("Windows-1252"), true, buffer.Length, true))
                {
                    string testString = await sr.ReadToEndAsync().ConfigureAwait(false);
                    if (!sr.CurrentEncoding.Equals(Encoding.GetEncoding("Windows-1252")))
                    {
                        encoding = sr.CurrentEncoding;
                    }
                    else if (headerCharset != null)
                    {
                        encoding = Encoding.GetEncoding(headerCharset, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
                    }
                    else
                    {
                        string inlineCharset = charsetRegex.Match(testString).Value;
                        if (!string.IsNullOrEmpty(inlineCharset))
                        {
                            encoding = Encoding.GetEncoding(inlineCharset, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
                        }
                        else
                        {
                            encoding = new UTF8Encoding(false, true);
                        }
                    }
                    if (encoding.Equals(Encoding.GetEncoding("iso-8859-1")))
                    {
                        encoding = Encoding.GetEncoding("Windows-1252", EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
                    }
                }
            }
            using (MemoryStream ms = new MemoryStream(bytes))
            {
                try
                {
                    using (StreamReader sr = new StreamReader(ms, encoding, false, 0x8000, true))
                    {
                        return await sr.ReadToEndAsync().ConfigureAwait(false);
                    }
                }
                catch (DecoderFallbackException)
                {
                    ms.Position = 0;
                    using (StreamReader sr = new StreamReader(ms, Encoding.GetEncoding("Windows-1252"), false, 0x8000, true))
                    {
                        return await sr.ReadToEndAsync().ConfigureAwait(false);
                    }
                }
            }
        }
    }
    

    You should wrap the method call in a try/catch, since HttpClient can throw errors, if the request fails.

    Update:

    In .Net Core, you don't have the 'Windows-1252' encoding (big mistake IMHO), so here you must settle with 'ISO-8859-1'.