Search code examples
mongodbmongoosemongodb-queryaggregation-frameworkmongodb-indexes

What is the best way to query an array of subdocument in MongoDB?


let's say I have a collection like so:

    {
    "id": "2902-48239-42389-83294",
    "data": {
        "location": [
            {
                "country": "Italy",
                "city": "Rome"
            }
        ],
        "time": [
            {
                "timestamp": "1626298659",
                "data":"2020-12-24 09:42:30"
            }
        ],
        "details": [
            {
                "timestamp": "1626298659",
                "data": {
                    "url": "https://example.com",
                    "name": "John Doe",
                    "email": "[email protected]"    
                }
            },
            {
                "timestamp": "1626298652",
                "data": {
                    "url": "https://www.myexample.com",
                    "name": "John Doe",
                    "email": "[email protected]"    
                }
            },
            {
                "timestamp": "1626298652",
                "data": {
                    "url": "http://example.com/sub/directory",
                    "name": "John Doe",
                    "email": "[email protected]"    
                }
            }
        ]
    }
}

Now the main focus is on the array of subdocument("data.details"): I want to get output only of relevant matches e.g:

db.info.find({"data.details.data.url": "example.com"})
  1. How can I get a match for all "data.details.data.url" contains "example.com" but won't match with "myexample.com". When I do it with $regex I get too many results, so if I query for "example.com" it also return "myexample.com"

  2. Even when I do get partial results (with $match), It's very slow. I tried this aggregation stages:

       { $unwind: "$data.details" },
    
       {
         $match: {
           "data.details.data.url": /.*example.com.*/,
         },
       },
       {
         $project: {
           id: 1,
           "data.details.data.url": 1,
           "data.details.data.email": 1,
         },
       },
    
  3. I really don't understand the pattern, with $match, sometimes Mongo do recognize prefixes like "https://" or "https://www." and sometime it does not.

More info: My collection has dozens of GB, I created two indexes:

  • Compound like so: "data.details.data.url": 1, "data.details.data.email": 1
  • Text Index: "data.details.data.url": "text", "data.details.data.email": "text"

It did improve the query performance but not enough and I still have this issue with the $match vs $regex. Thanks for helpers!


Solution

  • Your mistake is in the regex. It matches all URLs because the substring example.com is in all URLs. For example: https://www.myexample.com matches the bolded part.

    To avoid this you have to use another regex, for example that just start with that domain.

    For example:

    (http[s]?:\/\/|www\.)YOUR_SEARCH
    

    will check that what you are searching for is behind an http:// or www. marks. https://regex101.com/r/M4OLw1/1

    I leave you the full query.

    [
      {
        '$unwind': {
          'path': '$data.details'
        }
      }, {
        '$match': {
          'data.details.data.url': /(http[s]?:\/\/|www\.)example\.com/)
        }
      }
    ]
    

    Note: you must scape special characters from the regex. A dot matches any character and the slash will close your regex causing an error.