Search code examples
.netxmlquotations

Strange behaviour XmlDocument.LoadXML and GetElementByID, how to delcare string with quotation mark


Here's some c# code

string webPageStr =  @"<html><body><div id=""content"">good content</div><div id=""badcontent"">bad content</div></body></html>";
XmlDocument webPage = new XmlDocument();
webPage.LoadXml(webPageStr);

 XmlElement divElement = webPage.GetElementById("content");

and divElement is equal null and i don't know why

I have also tried declare webPageStr like that

string webPage =  @"<html><body><div id=&quot;content&quot;>good content</div><div id=&quot;badcontent&quot;>bad content</div></body></html>";

but XmlDocument throws en exception System.Xml.XmlException: "&" bad token

Whats wrong with this code?


Solution

  • You need to include a DOCTYPE declaration if you want to use the GetElementById method. It is because the function doesn't know what ID means for the given XML. In your case you are using XHTML, so you need to specify that when you want to find an element by id this means find a node that has an attribute named "id":

    string webPageStr = @"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 Transitional//EN"" ""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd""><html><body><div id=""content"">good content</div><div id=""badcontent"">bad content</div></body></html>";
    XmlDocument webPage = new XmlDocument();
    webPage.LoadXml(webPageStr);
    XmlElement divElement = webPage.GetElementById("content");
    

    This first approach means that you need web access to the DOCTYPE declaration when running your code (http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd)

    An alternative approach would be to use XPATH expression:

    string webPageStr = @"<html><body><div id=""content"">good content</div><div id=""badcontent"">bad content</div></body></html>";
    XmlDocument webPage = new XmlDocument();
    webPage.LoadXml(webPageStr);
    XmlNode divElement = webPage.SelectSingleNode("//div[@id=\"content\"]");