Search code examples
c#asp.netcharacter-encodinghtml-encode

How to write a generic code to read an html encoded with different encodings?


I'm trying to write a code to read the content of a web page, but I'm not sure of the used encoding in that page, so how can I write a generic code that returns the right string without the strange symbols? The encoding might be ("UTF-8", "windows-1256", ...). I've tried to but the UTF-8 but when the page is encoded with the second mentioned encoding I'm having some strange symbols.

Here is the code I'm using:

HttpWebRequest request = (HttpWebRequest)WebRequest.Create("SOME-URL");
request.Method = "GET";
WebResponse response = request.GetResponse();
StreamReader streamReader = new StreamReader(response.GetResponseStream(), System.Text.Encoding.UTF8);
string content = streamReader.ReadToEnd();

And here is a link that causes the problem: http://forum.khleeg.com/144828.html


Solution

  • You must examine the response text to check this field:

    <meta http-equiv="Content-Type" content="text/html; charset=windows-1256" />
    

    This chars will also get corretly decoded as they are ANSI. According to data from this tag you should create your Encoding object by the GetEncoding method like this:

    var enc1 = Encoding.GetEncoding("windows-1256");
    var enc2 = Encoding.GetEncoding(1256);
    

    Another way is to use the .ContentEncoding property of the HttpWebResponse:

    HttpWebResponse response = (HttpWebResponse)request.GetResponse();
    var enc1 = Encoding.GetEncoding(response.ContentEncoding);
    

    Or the .CharacterSet property:

    string Charset = response.CharacterSet;
    var enc1 = Encoding.GetEncoding(Charset);