Search code examples
javascriptxmlxpathtei

Cannot evaluate XPath on XMLDoc without placing it in the DOM


I have only been able to get result nodes from my DOM with XPath, which feels incorrect.

Setup:

I am attempting to show a fragment of an XML Document (TEI/XML) on my HTML page. I have the URL of an XML Document and an XPath selector of the fragment. I thought I could fetch() the document and extract the piece I wanted like so:

// Real values, for one case, 
// t.source = "https://centerfordigitalhumanities.github.io/Dunbar-books/The-Complete-Poems-TEI.xml"
// t.selector.value = "//div[@type='poem'][8]"

const sampleSource = await fetch(t.source)
  .then(res => res.text())
  .then(docStr => (new DOMParser()).parseFromString(docStr, "application/xml"))

const poemText = sampleSource.evaluate(t.selector?.value, sampleSource, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null)

textSample.innerHTML = poemText.snapshotItem(0).innerHTML

No Result

Trying several different ways (changing contextNode, using XPathSelector.evaluate() instead of XMLDoc.evaluate(), and changing XPathResult) the result was always empty.

In frustration, I tried simpler and simpler selectors and discovered that evaluate() was only traversing my current HTML document despite making no references to it.

The Workaround

It "works" to dump the XML doc into a hidden element on the page.

const sampleSource = await fetch(t.source)
  .then(res => res.text())
  .then(docStr => hiddenElem.innerHTML = docStr)

const poemText = document.evaluate(t.selector?.value, hiddenElem, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null)

textSample.innerHTML = poemText.snapshotItem(0).innerHTML

Questions

  1. Is this how it is supposed to work, that evaluate() only traverses document?
  2. Is there a better practice than my workaround?

Solution

  • Well, it is a TEI document so its elements are in the namespace http://www.tei-c.org/ns/1.0, don't expect to use XPath 1 against an XML DOM document and a selector like div to select elements in any namespace, it exactly selects div elements in no namespace. To select elements in a namespace with XPath 1.0, you need to use the third argument of evaluate and bind a prefix you can choose (like tei) to that namespace and use e.g. //tei:div[@type='poem'][8]:

    <script type=module>
    const sampleSource = await fetch('https://centerfordigitalhumanities.github.io/Dunbar-books/The-Complete-Poems-TEI.xml')
      .then(res => res.text())
      .then(docStr => (new DOMParser()).parseFromString(docStr, "application/xml"));
    
    const poemText = sampleSource.evaluate(`//tei:div[@type='poem'][8]`, sampleSource, prefix => prefix === 'tei' ? 'http://www.tei-c.org/ns/1.0' : null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
    
    console.log(poemText.snapshotItem(0).textContent);
    </script>

    With XPath 2 or 3, like Saxon-JS 2 for instance supports, you can bind a default element namespace and use an unqualified named like div to select elements in that namespace.

    <script src=https://www.saxonica.com/saxon-js/documentation/SaxonJS/SaxonJS2.rt.js></script>
    
    <script type=module>
        const sampleSource = await SaxonJS.getResource({ location : 'https://centerfordigitalhumanities.github.io/Dunbar-books/The-Complete-Poems-TEI.xml', type : 'xml' });
    
    
        const poemText = SaxonJS.XPath.evaluate(`//div[@type='poem'][8]`, sampleSource, { xpathDefaultNamespace : 'http://www.tei-c.org/ns/1.0' });
    
        console.log(poemText.textContent);
    </script>

    There is no way in XPath 1.0, unless the DOM environment allows you to build a namespace less DOM (like Java with a non-namespace aware DocumentBuilder). But inside of a browser that is not the case, as far as I know.