Search code examples
c#parsingwebxmldocument

Parsing web pages via C#, XmlDocument.LoadXml


I'm trying to download a web page and parse it. I need to reach every node of html document. So I used WebClient to download, which works perfectly. Then I use following code segment to parse the document:

 WebClient client = new WebClient();

 Stream data = client.OpenRead("http://web.cs.hacettepe.edu.tr/~bil339/");
 StreamReader reader = new StreamReader(data);
 string xml = reader.ReadToEnd();

 data.Close();
 reader.Close();
 XmlDocument xmlDoc = new XmlDocument();
 xmlDoc.loadXml(xml);

In last line, program waits for some time, then crashes. It says there are errors in HTML code, this wasn't expected, that shouldn't be here, etc. Any suggestions to fix this? Other techniques to parse HTML code are welcome (In C#, of course.)


Solution

  • Use the HTMLAgilityPack to parse HTML. Well-formed HTML is not XML and can't be parsed as such. For instance, it lacks the <?xml version="1.0" encoding="UTF-8"?> preamble that all XML files require. The HTML Agility Pack is more forgiving.