Search code examples
mongodbindexingfulltext-index

Mongo $text query: return docs "starting with" string before others


Let's say I have a mongo collection with a text index on the itemName field with these 3 documents:

{
    _id: ...,
    itemName: 'Mashed carrots with big carrot pieces',
    price: 1.29
},
{
    _id: ...,
    itemName: 'Carrot juice',
    price: 0.79
},
{
    _id: ...,
    itemName: 'Apple juice',
    price: 1.49
}

I then exec a query like so:

db.items.find({ $text: { $search: 'Car' } }, { score: { $meta: "textScore" } }).sort( { score: { $meta: "textScore" } } );

How do I force mongo to return documents beginning with "Car" (case insensitive) before returning any other docs also containing "Car" somewhere in the itemName string?

So I want to retrieve the docs in the following order:

[
    {..., itemName: 'Carrot Juice', ...},
    {..., itemName: 'Mashed carrots with big carrot pieces', ...}
]

Of course this is meant to be used in a search functionality, so it makes total sense to show the user the items starting with his search string before showing any other items after that.

Until now I was using standard regex, but the performance here is of course much worse! + since I have to search case insensitive, according to the docs, normal regex is not using any indices at all?!

EDIT:

Also, sometimes the behavior of $text is very weird. For example I have about 10-15 items where itemName begins with the word "Zwiebel". This query

db.items.find({ $text: { $search: "Zwiebel" }, supplier_id: 'iNTJHEf5YgBPicTrJ' }, { score: { $meta: "textScore" } }).sort( { score: { $meta: "textScore" } } );

works like a charm and returns all those documents, while this query

db.items.find({ $text: { $search: "Zwie" }, supplier_id: 'iNTJHEf5YgBPicTrJ' }, { score: { $meta: "textScore" } }).sort( { score: { $meta: "textScore" } } );

does not return anything! Only by changing "Zwiebel" to "Zwie" in the $search.

I really do not understand how this is possible?!

best, P


Solution

  • A solution is to use the $indexOfCP operator introcuced in MongoDB 3.4

    This operator return the index of the occurence of a String in an other String, and -1 if there is no occurence

    how it works:

    1. filter out all documents not containing 'car' with a regex: /car/gi (case inensitive)
    2. create a field named index which stores the index of 'car' in itemName
    3. sort documents on the index field

    the query would looks like this:

    db.items.aggregate([
       {
          $match:{
             itemName:/car/gi
          }
       },
       {
          $project:{
             index:{
                $indexOfCP:[
                   {
                      $toLower:"$itemName"
                   },
                   "car"
                ]
             },
             price:1,
             itemName:1
          }
       },
       {
          $sort:{
             index:1
          }
       }
    ])
    

    and this returns:

    { "_id" : 2, "itemName" : "Carrot juice", "price" : 0.79, "index" : 0 }
    { "_id" : 1, "itemName" : "Mashed carrots with big carrot pieces", "price" : 1.29, "index" : 7 }
    

    try it online: mongoplayground.net/p/FqqCUQI3D-E

    Edit:

    For the behavior of the $text index, this is completely normal

    A text index tokenize text using delimiters (default delimiters are white space and punctuation). It can only be used to search for whole worlds, and so it won't work for subpart of words

    from mongodb text index documentation

    $text will tokenize the search string using whitespace and most punctuation as delimiters, and perform a logical OR of all such tokens in the search string.