Search code examples
mongodbmongodb-querypymongo

MongoDB - Best way to delete documents by query based on results of another query


I have a collection that can contain several million documents, for simplicity, lets say they look like this:

{'_id': '1', 'user_id': 1, 'event_type': 'a', 'name': 'x'}
{'_id': '2', 'user_id': 1, 'event_type': 'b', 'name': 'x'}
{'_id': '3', 'user_id': 1, 'event_type': 'c', 'name': 'x'}
{'_id': '4', 'user_id': 2, 'event_type': 'a', 'name': 'x'}
{'_id': '5', 'user_id': 2, 'event_type': 'b', 'name': 'x'}
{'_id': '6', 'user_id': 3, 'event_type': 'a', 'name': 'x'}
{'_id': '7', 'user_id': 3, 'event_type': 'b', 'name': 'x'}
{'_id': '8', 'user_id': 4, 'event_type': 'a', 'name': 'x'}
{'_id': '9', 'user_id': 4, 'event_type': 'b', 'name': 'x'}
{'_id': '10', 'user_id': 4, 'event_type': 'c', 'name': 'x'}

I want to have a daily job that runs and deletes all documents by user_id, if the user_id has a doc with event_type 'c'

So the resulting collection will be

{'_id': '4', 'user_id': 2, 'event_type': 'a', 'name': 'x'}
{'_id': '5', 'user_id': 2, 'event_type': 'b', 'name': 'x'}
{'_id': '6', 'user_id': 3, 'event_type': 'a', 'name': 'x'}
{'_id': '7', 'user_id': 3, 'event_type': 'b', 'name': 'x'}

I did it successfully with mongoshell like this

var cur = db.my_collection.find({'event_type': 'c'})
ids = [];
while (cur.hasNext()) {
  ids.push(cur.next()['user_id']);
  if (ids.length == 5){
    print('deleting for user_ids', ids);
    print(db.my_collection.deleteMany({user_id: {$in: ids}}));
    ids = [];
  }
}
if (ids.length){db.my_collection.deleteMany({user_id: {$in: ids}})}

Created a cursor to hold all docs with event_type 'c', grouped them into batches of 5 then deleted all docs with these ids.

It works but looks very slow, like each cur.next() only gets one doc at a time.

I wanted to know if there is a better or more correct way to achieve this, if it was elasticsearch I would create a sliced scroll, scan each slice in parallel and submit parallel deleteByQuery requests with 1000 ids each. Is something like this possible/preferable with mongo?

Scale wise I expect there to be several million docs (~10M) at the collection, 300K docs that match the query, and ~700K that should be deleted


Solution

  • It sounds like you can just use deleteMany with the original query:

    db.my_collection.deleteMany({
        event_type: 'c'
    })
    

    No size limitations on it, it might just take a couple of minutes to run depending on instance size.


    EDIT:

    I would personally try to use the distinct function, it's the cleanest and easiest code. distinct does have a 16mb limit about 300k~ unique ids a day (depending on userid field size) sounds a bit close to the threshold, or past it.

    const userIds = db.my_collection.distinct('user_id', { event_type: 'c'});
    db.my_collection.deleteMany({user_id: {$in: userIds}})
    

    Assuming you except scale to increase, or this fails your tests then the best way is to use something similar to your approach, just in much larger batches. for example:

    const batchSize = 100000;
    const count = await db.my_collection.countDocuments({'event_type': 'c'});
    let iteration = 0;
    
    while (iteration * batchSize < count) {
        const batch = await db.my_collection.find({'event_type': 'c'}, { projection: { user_id: 1}}).limit(batchSize).toArray();
        if (batch.length === 0) {
            break
        }
        await db.my_collection.deleteMany({user_id: {$in: batch.map(v => v.user_id)}});
        iteration++
    }