I have a MongoDB collection with around 200 million entries for different websites. The documents are structured like this:
{
"_id": {
"$oid": "6647314adcfdad8dd8b1c84f"
},
"url": "http://example.com/login.php",
"name": "Jack"
}
I have both a regular index and a text index on the url field. When I try to search for subdomains for a specific domain, the query almost never completes:
{
"url": {
$regex: "^http://([a-zA-Z0-9-]+\\.)*google\\.com"
}
}
The explain feature indicates that my index is being used. I also tried simplifying the regex to search for a simple http or https pattern, but that also failed to improve performance.
The only search that is fast is an exact domain search:
{
"url": {
$regex: "^https://example.com"
}
}
I'm running it on an intel i5-12450H and 32GBs of ram.
What am I doing wrong? Or is MongoDB not designed for this kind of task?
Solved: Just create a separate field for the domain. I extract it using python's tldextract. It's much faster this way.