Search code examples
marklogic

Is it possible to resolve a punctuation-sensitive search from the index?


I have a search application where, due to the nature of the documents, users frequently include (relevant) punctuation in their search terms. This often leads to result estimates being quite different to the actual, filtered, result count.

What I'd like to do, given I know the nature of the searches I'm going to be running, is configure the universal index to reflect that. In this case, I never want to run a punctuation insensitive search, so it seems like configuring ML to include punctuation characters as "word characters" for the purposes of building its term list would make the estimates match the actual matches much more closely.

I haven't been able to find any way of configuring ML to build the universal index that way (I assume there'd be a "fast punctuation sensitive searches" option); I even tried creating a word lexicon with punctuation-sensitive collation in the hope ML would use that as a hint as to how to configure it's term list generation, but no dice.

In an ideal world I'd be able to configure two term lists; one punctuation-sensitive, and one not, but for the purposes of this question, just picking between the two would be sufficient.

Is anything like this possible?


Solution

  • The universal index does index punctuation, but only for node values not for words. The term lists for word query do not include punctuation, because the tokenizer defines words as strings that do not contain whitespace or punctuation. The docs at http://docs.marklogic.com/guide/search-dev/languages discuss tokenization. At http://docs.marklogic.com/guide/search-dev/custom-dictionaries they also describe how to modify that behavior using a custom tokenization and stemming dictionaries. However for most languages that feature still does not allow words to include punctuation.

    So what can you do? It would help to know more about the application domain, to understand exactly why the searches are so sensitive to punctuation. Lacking that detail, I think the answer will be to somehow turn word terms into value terms. That might involve some combination of content enrichment, transforming word terms into value terms, and query expansion using punctuation-sensitive range indexes.

    For content enrichment, could you mark up the punctuation-sensitive words and phrases? This could work particularly well if the crucial terms are something like code groups: for example foo$bar amid other text. By marking that up as <psv>foo$bar</psv> you might be able to detect foo$bar in a query and then use a punctuation-sensitive cts:element-value-query instead of a word query.

    Given that extra markup you could also create a range index on psv using a punctuation-sensitive collation. Then a range-index constraint would map psv:"foo$bar" to a range query term on that index.

    Another use for a range index would be query-time expansion: turn each punctuation-sensitive word term into an OR of all possible value terms. This would work best if the range index node will contain relatively values. This approach requires some extra work in the application code, which would have to ensure that the right query terms use the right range index. That could be done as a post-processing step to search:parse, or a custom parser like xqysp. The core idea is to identify user input terms that needs expansion, then replace what would be a cts:word-query term with a cts:element-range-query term, using values from a cts:element-value-match lookup.