Search code examples
marklogic

unfiltered wildcard cts:element-value-query give false results


I got many false search results from cts:element-value-query with "unfiltered" "whitespace-sensitive" "wildcarded" option. I want to know the explanation. (filtered CTS search seems to be working fine. It seems it works if I remove "whitespace-sensitive")

Here is the target element value I want to find 'ABC/DE 123/FG HI12'

xquery version "1.0-ml";

(:Case 1:)
let $term1 := 'ABC/DE 123'

(:Case 2:)
let $term2 := 'ABC/DE 123*'

(:Case 3:)
let $term3 := 'ABC/DE 123 *'

(:Case 4:)
let $term4 := 'ABC/DE 123* '

(:Case 5:)
let $term5 := 'ABC/DE 123* *'


let $queries := cts:and-query((cts:element-value-query(fn:QName("","name"), $term2, ("case-insensitive","diacritic-sensitive","punctuation-sensitive","whitespace-sensitive", "wildcarded","lang=en"), 1), cts:collection-query("http://marklogic.com/collections/dls/latest-version")))


return 
xdmp:plan(cts:search(fn:doc(), $queries, 'unfiltered'))

Here are the query plan for case 2 and case 3. My question is why they are different.

case2

case3


Solution

  • The plans are different because you have the wildcard in different places, and with cts:element-value-query wildcard matches do not span word boundaries.

    • With case 2 ABC/DE 123*, the wildcard is for the "word" starting with 123.
    • With case 3 ABC/DE 123 *, it will look for 123 and then * matches everything after it.

    There is a note in the cts:element-value-query about wildcards that explains:

    • Note that the text content for the value in a cts:element-value-query is treated the same as a phrase in a cts:word-query, where the phrase is the element value. Therefore, any wildcard and/or stemming rules are treated like a phrase. For example, if you have an element value of "hello friend" with wildcarding enabled for a query, a cts:element-value-query for "he*" will not match because the wildcard matches do not span word boundaries, but a cts:element-value-query for "hello " will match. A search for "" will match, because a "*" wildcard by itself is defined to match the value. Similarly, stemming rules are applied to each term, so a search for "hello friends" would match when stemming is enabled for the query because "friends" matches "friend".

    Also, some of the relevant bullet points from the Rules for Wildcard Searches:

    • Spaces are used as word breaks, and wildcard matching only works within a single word. For example, m*th* will match method but not meet there.
    • If the * wildcard is specified with a non-wildcard character, it will match in value lexicon queries (for example, cts:element-value-match), but will not match in value queries (for example, cts:element-value-query). For example, m* will match the value meet me there for a value lexicon search (for example, cts:element-value-match) but will not match the value for a value query search (for example, cts:element-value-query), because the value query only matches the one word. A value search for m* * will match the value (because m* matches the first word and * matches everything after it).
    • If the query has the whitespace-sensitive option, then whitespace is treated as word characters. This can be useful for matching spaces in wildcarded value queries. You can use the whitespace-sensitive option in wildcarded word queries, too, although it might not make much sense, as it will match more than you might expect.