Search code examples
databasemongodbmongodb-query

slow Mongodb regex for 200 million documents


I have a MongoDB collection with around 200 million entries for different websites. The documents are structured like this:

{
  "_id": {
    "$oid": "6647314adcfdad8dd8b1c84f"
  },
  "url": "http://example.com/login.php",
  "name": "Jack"
}

I have both a regular index and a text index on the url field. When I try to search for subdomains for a specific domain, the query almost never completes:

{
  "url": {
    $regex: "^http://([a-zA-Z0-9-]+\\.)*google\\.com"
  }
}

The explain feature indicates that my index is being used. I also tried simplifying the regex to search for a simple http or https pattern, but that also failed to improve performance.

The only search that is fast is an exact domain search:

{
  "url": {
    $regex: "^https://example.com"
  }
}

I'm running it on an intel i5-12450H and 32GBs of ram.

What am I doing wrong? Or is MongoDB not designed for this kind of task?


Solution

  • Solved: Just create a separate field for the domain. I extract it using python's tldextract. It's much faster this way.