Search code examples
c#encodingrssxmlreader

RSS Reader(WebClient/XmlTextReader) somehow uses wrong encodig - including failing Unittest


I have a rssReader working fine on some rss feeds but I got one where I have some problems with special danish charaters - encoding I expect.

I see this encoding in the raw Http response form this url: http://www.sydvestjyllandsefterskole.dk/rss

Content-Type: text/xml; Charset=UTF-8 
<?xml version="1.0" encoding="iso-8859-1" ?>

Have tried those 2 encodings and others but nothing seems to work.

I have made a unittest to show the problem and what I have tried: (NUnit)

public IEnumerable<TestCaseData> RssItemEncodingTestCases
{
    get
    {
        yield return new TestCaseData("http://www.sydvestjyllandsefterskole.dk/rss", "Stort fremmøde til dejlig familiedag.", new ASCIIEncoding());
        yield return new TestCaseData("http://www.sydvestjyllandsefterskole.dk/rss", "Stort fremmøde til dejlig familiedag.", new UTF8Encoding());
        yield return new TestCaseData("http://www.sydvestjyllandsefterskole.dk/rss", "Stort fremmøde til dejlig familiedag.", new UnicodeEncoding());
        yield return new TestCaseData("http://www.sydvestjyllandsefterskole.dk/rss", "Stort fremmøde til dejlig familiedag.", Encoding.GetEncoding("ISO-8859-1"));
    }
}

[TestCaseSource("RssItemEncodingTestCases")]
public void TestEncoding(string url, string expectedToStartWith, Encoding encoding)
{
    var description = Read(url, encoding);

    Assert.That(description, Is.StringStarting(expectedToStartWith));
}

public string Read(string url, Encoding encoding = null)
{
    var client = new WebClient();
    if (encoding != null)
        client.Encoding = encoding;
    try
    {
        using (XmlReader reader = new XmlTextReader(client.OpenRead(url)))
        {
            while (reader.Read())
            {
                if (reader.IsStartElement() & reader.Name == "item")
                {
                    while (reader.Read())
                    {
                        switch (reader.Name)
                        {
                            case "description":
                                return reader.ReadElementContentAsString();
                        }
                        if (reader.Name == "item" & reader.NodeType == XmlNodeType.EndElement)
                            break;
                    }
                }
            }
        }
    }
    catch
    {

    }
    return null;
}

Expected: String starting with "Stort fremmøde til dejlig familiedag." But was: "Stort fremmøde til dejlig familiedag.

Any idea how to get this decoded properly?


Solution

  • It got fixed by making them change the encoding of the RSS feed to utf-8:

    <?xml version="1.0" encoding="utf-8" ?>