mongodb mongoose mongodb-query aggregation-framework mongodb-indexes

What is the best way to query an array of subdocument in MongoDB?

let's say I have a collection like so:

    {
    "id": "2902-48239-42389-83294",
    "data": {
        "location": [
            {
                "country": "Italy",
                "city": "Rome"
            }
        ],
        "time": [
            {
                "timestamp": "1626298659",
                "data":"2020-12-24 09:42:30"
            }
        ],
        "details": [
            {
                "timestamp": "1626298659",
                "data": {
                    "url": "https://example.com",
                    "name": "John Doe",
                    "email": "[email protected]"    
                }
            },
            {
                "timestamp": "1626298652",
                "data": {
                    "url": "https://www.myexample.com",
                    "name": "John Doe",
                    "email": "[email protected]"    
                }
            },
            {
                "timestamp": "1626298652",
                "data": {
                    "url": "http://example.com/sub/directory",
                    "name": "John Doe",
                    "email": "[email protected]"    
                }
            }
        ]
    }
}

Now the main focus is on the array of subdocument("data.details"): I want to get output only of relevant matches e.g:

db.info.find({"data.details.data.url": "example.com"})

How can I get a match for all "data.details.data.url" contains "example.com" but won't match with "myexample.com". When I do it with $regex I get too many results, so if I query for "example.com" it also return "myexample.com"

Even when I do get partial results (with $match), It's very slow. I tried this aggregation stages:

   { $unwind: "$data.details" },

   {
     $match: {
       "data.details.data.url": /.*example.com.*/,
     },
   },
   {
     $project: {
       id: 1,
       "data.details.data.url": 1,
       "data.details.data.email": 1,
     },
   },

I really don't understand the pattern, with $match, sometimes Mongo do recognize prefixes like "https://" or "https://www." and sometime it does not.

More info: My collection has dozens of GB, I created two indexes:

Compound like so: "data.details.data.url": 1, "data.details.data.email": 1
Text Index: "data.details.data.url": "text", "data.details.data.email": "text"

It did improve the query performance but not enough and I still have this issue with the $match vs $regex. Thanks for helpers!

Solution

Your mistake is in the regex. It matches all URLs because the substring example.com is in all URLs. For example: https://www.myexample.com matches the bolded part.

To avoid this you have to use another regex, for example that just start with that domain.

For example:

(http[s]?:\/\/|www\.)YOUR_SEARCH

will check that what you are searching for is behind an http:// or www. marks. https://regex101.com/r/M4OLw1/1

I leave you the full query.

[
  {
    '$unwind': {
      'path': '$data.details'
    }
  }, {
    '$match': {
      'data.details.data.url': /(http[s]?:\/\/|www\.)example\.com/)
    }
  }
]

Note: you must scape special characters from the regex. A dot matches any character and the slash will close your regex causing an error.