Search code examples
rregexmongodbmongolite

word boundary in regex with mongolite


I am facing a problem using word boundary regex with mongolite. It looks like the word boundary \b does not work, whereas it works in norm MongoDB queries.

Here is a working example:

I create this toy collection:

db.test2.insertMany([
   { item: "journal gouttiere"},
   { item: "notebook goutte"},
   { item: "paper plouf"},
   { item: "planner gouttement"},
   { item: "postcard goutte"}
]);

With mongosh:

db.test2.aggregate(
  {
$match: {
    item: RegExp("\\bgoutte\\b")
  }
 })

returns:

[
  {
    "_id": {
      "$oid": "63206efeb0e1e89db6ef0c20"
    },
    "item": "notebook goutte"
  },
  {
    "_id": {
      "$oid": "63206efeb0e1e89db6ef0c23"
    },
    "item": "postcard goutte"
  }
]

But:

library(mongolite)

connection <- mongo(collection="test2",db="test",
                    url = "mongodb://localhost:27017",
                    verbose = T)

connection$aggregate(pipeline = '[{
      "$match": {
      "item":{"$regex" : "\\bgoutte\\b", "$options" : "i"}
      }
}]',options = '{"allowDiskUse":true}')

returns 0 lines. Changing to

connection$aggregate(pipeline = '[{
      "$match": {
      "item":{"$regex" : "goutte", "$options" : "i"}
      }
}]',options = '{"allowDiskUse":true}')


 Imported 3 records. Simplifying into dataframe...
                       _id               item
1 63206efeb0e1e89db6ef0c20    notebook goutte
2 63206efeb0e1e89db6ef0c22 planner gouttement
3 63206efeb0e1e89db6ef0c23    postcard goutte

It looks like the word boundary regex does not work the same with mongolite. What is the proper solution ?


Solution

  • Ottie is right (and should post an answer!–I'd be fine with deleting mine then):

    Backslashes have special meaning for both R and in the regex. You need two additional backslashes (one per \) to pass \\ from R to mongoDB (where you escape \b by \\b), see e.g. this SO question. I just checked:

    con <- mongo(
     "test", 
     url = "mongodb+srv://readwrite:[email protected]/test"
    )
    
    con$insert('{"item": "notebook goutte" }')
    con$insert('{"item": "postcard goutte" }')
    

    Now

    con$aggregate(pipeline = '[{
          "$match": {
          "item":{"$regex" : "\\\\bgoutte\\\\b", "$options" : "i"}
          }
    }]',options = '{"allowDiskUse":true}')
    

    yields

                           _id            item
    1 63234ac1435f9b7c2a0787c2 notebook goutte
    2 63234ac5435f9b7c2a0787c5 postcard goutte