Search code examples
marklogic

Sounds like searching in marklogic


Does marklogic have the option of performing a search where it looks for words in documents that sound like the terms in the query text?

I couldn't find anything on it. So I tried making my own with spell:levenshtein-distance in combination with cts:tokenize and cts:words you can view it here https://github.com/freshie/ml-levenshtein-search/blob/master/levenshtein-distance.xqy

That does really give me what I want though.. it just gives me words that are spelled close to each other. Any thoughts on how to do sounds like search?


Solution

  • It's easiest to first use cts:words() to make a dictionary of possible words based on the corpus. Then use spell:suggest-detailed() to find similar matches based on the query text, limited to some distance away. The spell expansion algorithm is based on double metaphone which is better than Levenshtein because it's phonetic and you want sounds-like not spelled-like. I've found limiting things to a distance of 25 gives you a decent level of fuzz.

    In advance:

    spell:insert("dictionary.xml", spell:make-dictionary($word-sequence))
    

    Then (in 0.9-ml dialect):

    define function expand-spell($word as xs:string)
      as xs:string*
    {
      let $threshold := 25
      let $options := <options xmlns="http://marklogic.com/xdmp/spell">
                        <distance-threshold>{ $threshold }</distance-threshold>
                        <maximum>20</maximum>
                      </options>
      for $suggest in spell:suggest-detailed("dictionary.xml", $word, $options)//spell:word
      order by $suggest/@word-distance
      return string($suggest)
    }
    

    I ordered by distance so I could display the expansion inside a demo and closer matches would be higher in the list.