Search code examples
javascripthtmlnode.jsdomxpath

Using XPath in node.js


I am building a little document parser in node.js. To test, I have a raw HTML file, that is generally downloaded from the real website when the application executes.

I want to extract the first code example from each section of the Console.WriteLine that matches my constraint - it has to be written in C#. To do that, I have this sample XPath:

//*[@id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::div/following-sibling::div/pre[position()>1]/code[contains(@class,'lang-csharp')]

If I test the XPath online, I get the expected results, which is in this Gist.

In my node.js application, I am using xmldom and xpath to try and parse that exact same information out:

var exampleLookup = `//*[@id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::div/following-sibling::div/pre[position()>1]/code[contains(@class,'lang-csharp')]`;
var doc = new dom().parseFromString(rawHtmlString, 'text/html');
var sampleNodes = xpath.select(exampleLookup,doc);

This does not return anything, however.

What might be going on here?


Solution

  • This is most likely caused by the default namespace (xmlns="http://www.w3.org/1999/xhtml") in your HTML (XHTML).

    Looking at the xpath docs, you should be able to bind the namespace to a prefix using useNamespaces and use the prefix in your xpath (untested)...

    var exampleLookup = `//*[@id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::x:div/following-sibling::x:div/x:pre[position()>1]/x:code[contains(@class,'lang-csharp')]`;
    var doc = new dom().parseFromString(rawHtmlString, 'text/html');
    var select = xpath.useNamespaces({"x": "http://www.w3.org/1999/xhtml"});
    var sampleNodes = xpath.select(exampleLookup,doc);
    

    Instead of binding the namespace to a prefix, you could also use local-name() in your XPath, but I wouldn't recommend it. This is also covered in the docs.

    Example...

    //*[@id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::*[local-name()='div']/following-sibling::*[local-name()='div']/*[local-name()='pre'][position()>1]/*[local-name()='code'][contains(@class,'lang-csharp')]