Search code examples
javaarangodbarangojs

How to do a fulltext search if the string has '-' in it for e.g "3da549f0-0e88-4297-b6af-5179b74bd929"?


When I indexed the field and search for a string which has '-' in it like the above example then the Arango treat it as negation operator and hence do not search that string. What is the solution of searching these documents which contains '-' in it ?


Solution

  • Trying to reproduce what you did. My answer probably could be more accurate if you provide a better reproducible example (with arangosh only) what you're currently trying:

    http+tcp://127.0.0.1:8529@_system> db._create("testIndex")
    http+tcp://127.0.0.1:8529@_system> db.testIndex.ensureIndex({type: "fulltext", fields: ["complete:3da549f0-0e88-4297-b6af-5179b74bd929"]})
    { 
      "fields" : [ 
        "complete:3da549f0-0e88-4297-b6af-5179b74bd929" 
      ], 
      "id" : "testIndex/4687162", 
      "minLength" : 2, 
      "sparse" : true, 
      "type" : "fulltext", 
      "unique" : false, 
      "isNewlyCreated" : true, 
      "code" : 201 
    }
    
    http+tcp://127.0.0.1:8529@_system> db.testIndex.save({'complete:3da549f0-0e88-4297-b6af-5179b74bd929': "find me"})
    { 
      "_id" : "testIndex/4687201", 
      "_key" : "4687201", 
      "_rev" : "4687201" 
    }
    
    http+tcp://127.0.0.1:8529@_system> db._query('FOR doc IN FULLTEXT(testIndex, "complete:3da549f0-0e88-4297-b6af-5179b74bd929", "find") RETURN doc')
    [object ArangoQueryCursor, count: 1, hasMore: false]
    
    
    [ 
      { 
        "_id" : "testIndex/4687201", 
        "_key" : "4687201", 
        "_rev" : "4687201", 
        "complete:3da549f0-0e88-4297-b6af-5179b74bd929" : "find me" 
      } 
    ]
    

    So the usecase looks different:

    db.test2.save({id: 'complete:3da549f0-0e88-4297-b6af-5179b74bd929'})
    db.test2.ensureIndex({type: "fulltext", fields: ["id"]})
    
    db._query('FOR doc IN FULLTEXT(test2, "id", "3da549f0-0e88-4297-b6af-5179b74bd929") RETURN doc')
    

    which will return an empty result.

    To understand whats going on, one needs to know how the fulltext index works. It splits the texts at word boundaries and stores this as a list with a reference to the document in the index. Several documents may be referenced by one word in that index-global wordlist.

    Once the index is queried, the requested words are searched in the index global wordlist, and each word found will contain a list of documents with the words in them. These buckets are combined, and returned as a total list of documents to be iterated.

    To understand the tokenizer a little better, I've added a tiny js wrapper that invokes it.

    Lets have a look at what it does to your string:

    SYS_SPLIT_WORDS_ICU("ab cd", 0)
    [ 
      "ab", 
      " ", 
      "cd" 
    ]
    SYS_SPLIT_WORDS_ICU("3da549f0-0e88-4297-b6af-5179b74bd929", 0)
    [ 
      "3da549f0", 
      "-", 
      "0e88", 
      "-", 
      "4297", 
      "-", 
      "b6af", 
      "-", 
      "5179b74bd929" 
    ]
    

    So you see, minus are treated as word boundaries, and your string is partitioned. You've got now several opportunities to circumvent this:

    • remove the minuses on insert
    • split the search string, and use the most meaningfull part of the hash, followed by a FILTER statement for the actual value
    • don't use the fulltext index at all for this, but rather a skiplist or a hash index; They're cheaper to maintain, and can be used for FILTER statements