I got many false search results from cts:element-value-query with "unfiltered" "whitespace-sensitive" "wildcarded" option. I want to know the explanation. (filtered CTS search seems to be working fine. It seems it works if I remove "whitespace-sensitive")
Here is the target element value I want to find 'ABC/DE 123/FG HI12'
xquery version "1.0-ml";
(:Case 1:)
let $term1 := 'ABC/DE 123'
(:Case 2:)
let $term2 := 'ABC/DE 123*'
(:Case 3:)
let $term3 := 'ABC/DE 123 *'
(:Case 4:)
let $term4 := 'ABC/DE 123* '
(:Case 5:)
let $term5 := 'ABC/DE 123* *'
let $queries := cts:and-query((cts:element-value-query(fn:QName("","name"), $term2, ("case-insensitive","diacritic-sensitive","punctuation-sensitive","whitespace-sensitive", "wildcarded","lang=en"), 1), cts:collection-query("http://marklogic.com/collections/dls/latest-version")))
return
xdmp:plan(cts:search(fn:doc(), $queries, 'unfiltered'))
Here are the query plan for case 2 and case 3. My question is why they are different.
The plans are different because you have the wildcard in different places, and with cts:element-value-query
wildcard matches do not span word boundaries.
ABC/DE 123*
, the wildcard is for the "word" starting with 123
.ABC/DE 123 *
, it will look for 123
and then *
matches everything after it.There is a note in the cts:element-value-query
about wildcards that explains:
- Note that the text content for the value in a
cts:element-value-query
is treated the same as a phrase in acts:word-query
, where the phrase is the element value. Therefore, any wildcard and/or stemming rules are treated like a phrase. For example, if you have an element value of "hello friend" with wildcarding enabled for a query, acts:element-value-query
for "he*" will not match because the wildcard matches do not span word boundaries, but acts:element-value-query
for "hello " will match. A search for "" will match, because a "*" wildcard by itself is defined to match the value. Similarly, stemming rules are applied to each term, so a search for "hello friends" would match when stemming is enabled for the query because "friends" matches "friend".
Also, some of the relevant bullet points from the Rules for Wildcard Searches:
- Spaces are used as word breaks, and wildcard matching only works within a single word. For example,
m*th*
will matchmethod
but notmeet there
.- If the
*
wildcard is specified with a non-wildcard character, it will match in value lexicon queries (for example,cts:element-value-match
), but will not match in value queries (for example,cts:element-value-query
). For example,m*
will match the valuemeet me there
for a value lexicon search (for example,cts:element-value-match
) but will not match the value for a value query search (for example,cts:element-value-query
), because the value query only matches the one word. A value search form* *
will match the value (becausem*
matches the first word and*
matches everything after it).- If the query has the
whitespace-sensitive
option, then whitespace is treated as word characters. This can be useful for matching spaces in wildcarded value queries. You can use thewhitespace-sensitive
option in wildcarded word queries, too, although it might not make much sense, as it will match more than you might expect.