Search code examples
xmlxquerymarklogicmarklogic-10marklogic-dhf

MarkLogic: Find documents where at least one parent element does not have a particular child


Using the built-in MarkLogic cts functions, I'd like to be able to write a query which can find a document like the following -- where exists at least 1 element that does not have a <child1> element (parents/parent[2]). BUT I do not want to exclude documents which have the <child1> element (parents/parent[1] or parents/parent[3]) from the search results.

<doc>
   <root>
      <parents>
         <parent>
            <child1>someValue</child1>
            <child2>someValue</child2>
         </parent>
         <parent>
            <child2>someValue</child2>
         </parent>
         <parent>
            <child1>someValue</child1>
            <child2>someValue</child2>
         </parent>
      </parents>
   </root>
</doc>

My thought process was that simply negating the following would return what I'm searching for:

Positive xQuery:

let $query :=
cts:element-query(
   xs:QName('parent')
   ,cts:element-query(
      xs:QName('child1')
      ,cts:true-query()
      )
   )
return cts:search(fn:doc(),$query)

or using the search module:

xquery version "1.0-ml";
import module namespace search = "http://marklogic.com/appservices/search" at "/MarkLogic/appservices/search/search.xqy";

let $options := 
<options xmlns="http://marklogic.com/appservices/search">
        
  <extract-document-data selected="include">
    <extract-path xmlns:es="http://marklogic.com/entity-services">//root</extract-path>
  </extract-document-data>

  <additional-query>
      <cts:element-query>
        <cts:element>parent</cts:element>
          <cts:element-query>
            <cts:element>child1</cts:element>
            <cts:true-query>
            </cts:true-query>
        </cts:element-query>
      </cts:element-query>
  </additional-query>
  
</options>
return search:search("",$options)

Leading to my attempted query:

Negative xQuery:

let $query :=
cts:not-query(
   cts:element-query(
      XS:QName('parent')
      ,cts:element-query(
         XS:QName('child1')
         ,cts:true-query()
         )
      )
   )
return cts:search(fn:doc(),$query)

Upon further evaluation though, it's clear why the "Negative" query does not evaluate as I'd expect...The positive query returns documents where the path //parent/child1 exists... the opposite of this is "return documents where //parent/child1 does not exist"...

Nonetheless, I am perplexed how to do this in an efficient way utilizing MarkLogic's cts functions. This database harvests billions of documents, vanilla xquery/xpath will be time consuming. I'm really hoping to achieve this using the search module/api -- Please keep in mind (despite my search module example above) that to run this query I'm hoping to make it via an api call to the search REST endpoint, so I will not be able to enhance the server side search with xQuery. Although if it can only be achieved using pure xQuery, it is what it is and I can just use the eval REST endpoint.

While looking for information I did come across this similar post from 6 years ago: search-xmls-which-do-not-have-particular-element-in-marklogic

But it has been a fair amount of time since that was asked, its tagged for marklogic-8, and my question differs to a good degree since I'm hoping to achieve this with the out of the box search module/api.


Solution

  • I eventually found an answer to a similar question: https://stackoverflow.com/a/73504631/2292130

    With the element position index enabled, you should be able to use cts:not-in-query. Rule in all the hits for parent, and then rule out all the hits for parent/child1.

    
    cts:not-in-query(
      cts:element-query(xs:QName('parent'), cts:true-query()),
      cts:element-query(xs:QName('parent'),
        cts:element-query(xs:QName('child1'), cts:true-query())
      )
    )
    

    This works because unlike, e.g. cts:and-not-query, the position of the matches in the document is taken into account before returning results.

    Returns a query matching the first sub-query, where those matches do not occur within 0 distance of the other query.

    Your example document will have three matches for the positive query (that is to say, parent nodes), and two hits for the negative query (child1 nodes in parent nodes). But those two negative-query matches are at the same position (0 positions removed from) two of the positive-query matches. Those two overlaps get excluded, but the document has one positive match remaining, which therefore counts as a match for the cts:not-in-query as a whole.