Search code examples
marklogicmarklogic-9

MarkLogic stemming basic


I have my database language set to en and stemmed searches to Basic, word searches disabled.

For a document like the following i exected queries only to work for the first/shortest stem to be found (as described here). The stem for further returns 3 stems: further, farther and far. I checked this with

cts:stem("further")

So as Basic stemmed searches should only index the shortest stem, i expected a search with farther not to find my document. But this is not the case.

xquery version "1.0-ml";

let $doc := 
<doc>
  <title>further</title>
</doc>

return xdmp:document-insert('test.xml', $doc);

cts:search(doc(), cts:word-query("farther")); // finds my document

cts:stem("further")

Is there anything im am misunderstanding? Why does a search for farther find a doc with further even if it is not the shortest/first stem?

Also a search for the third stem finds my document, even if with a "unstemmed" option (word searches enabled in this case).

cts:search(doc(), cts:word-query("further", ("unstemmed")));

Using MarkLogic 9.0-7.2.


Solution

  • The universal index in MarkLogic has multiple parts. There is one for stemmed searches, and one for unstemmed/wildcarded searches. The stemmed part of the index contains stems, but the unstemmed part has the unstemmed tokens. That is why the unstemmed search on the actual value finds a match.

    About the stemmed search: as you can read in the documentation of cts:stem, that function returns all stems regardless of database setting. However, the order in which it returns them is important. cts:stem("further") returns far, further, farther, cts:stem("farther") returns far, farther, further, and cts:stem("far") returns far.

    From what I understood is that basic stemming takes the first item returned by cts:stem, and uses that for indexing. As you can read from above, that means it uses far for further, farther, as well as for far. Advanced stemming would allow you to find further when doing a stemmed search for farther, and vice versa.

    Some more detail is available in the Search guide in the section: 'Stemming in MarkLogic'

    HTH!