I have my database language set to en
and stemmed searches
to Basic
, word searches
disabled.
For a document like the following i exected queries only to work for the first/shortest stem to be found (as described here). The stem for further
returns 3 stems: further, farther and far. I checked this with
cts:stem("further")
So as Basic stemmed searches should only index the shortest stem, i expected a search with farther
not to find my document. But this is not the case.
xquery version "1.0-ml";
let $doc :=
<doc>
<title>further</title>
</doc>
return xdmp:document-insert('test.xml', $doc);
cts:search(doc(), cts:word-query("farther")); // finds my document
cts:stem("further")
Is there anything im am misunderstanding? Why does a search for farther
find a doc with further
even if it is not the shortest/first stem?
Also a search for the third stem finds my document, even if with a "unstemmed" option (word searches
enabled in this case).
cts:search(doc(), cts:word-query("further", ("unstemmed")));
Using MarkLogic 9.0-7.2.
The universal index in MarkLogic has multiple parts. There is one for stemmed searches, and one for unstemmed/wildcarded searches. The stemmed part of the index contains stems, but the unstemmed part has the unstemmed tokens. That is why the unstemmed search on the actual value finds a match.
About the stemmed search: as you can read in the documentation of cts:stem
, that function returns all stems regardless of database setting. However, the order in which it returns them is important. cts:stem("further")
returns far, further, farther
, cts:stem("farther")
returns far, farther, further
, and cts:stem("far")
returns far
.
From what I understood is that basic stemming takes the first item returned by cts:stem, and uses that for indexing. As you can read from above, that means it uses far
for further
, farther
, as well as for far
. Advanced stemming would allow you to find further
when doing a stemmed search for farther
, and vice versa.
Some more detail is available in the Search guide in the section: 'Stemming in MarkLogic'
HTH!