MongoDB: fastest way to compare two large sets

I have a MongoDB collection with more than 20 millions documents (and growing fast). Some document have an 'user_id' (others, don't).

I regularly need to check if some user_id exists in the collection. But there is a lot of 'some'. 10K to 100K.

How would you do that?

The first thought is to have a big query:

$or : [ { 'user_id' : ... } , { ... } , ... ]

With all my 10K to 100K ids... But it's very slow, and I don't think it's the better way to do it. Someone told me about Redis' bloom filters, but it seems that Mongo does not do that.

Do you see a better way to do it? Like with bloom filters, a < 100% accuracy is acceptable in my case.

Solution

You can try using $in for set comparison:

db.users.find({ _id : { $in : [ 1, 2, 3, 4 ] } })

If you have an index on the field you are searching over, the operation should be fast. If you don't have an index, and you anticipate needing to repeat that query often, you should definitely build one. As mentioned in the comments, a sparse index would suit your collection if the user_id field is only present in some documents.

I did a benchmark in IPython on a collection with ~200M documents, on a test database on a relatively high spec laptop:

import random
from pymongo import MongoClient

client = MongoClient()
db = client["anonymised"]

# generate 100K random ids for testing purposes
ids = [ random.randint(0, int(1e6)) for i in range 100000 ]

%time db.users.count({ "_id" : { "$in" : ids } })
CPU times: user 182 ms, sys: 75.5 ms, total: 257 ms
Wall time: 16.1 s
# returns 32631

If you want to improve on this, you would have to look at sharding your database to keep a more important fraction in active memory. In a production environment with the entire working set in memory, this operation would presumably be significantly faster.

By contrast, the '$or' approach you initially took:

query = [ { "_id" : v } for v in ids ]

%time db.users.count({ "$or" : query })
CPU times: user 1.4 s, sys: 682 ms, total: 2.08 s
Wall time: 35min 30s