Optimized way to update large embedded array in mongo document with an indexed key

I have a user collection with 52 million records. Each user documents has a list of comments and comment_id has a unique index on it.

{
  _id:123, 
  user_name:"xyz",
  comments:[
    {
      comment_id:123,
      text:"sd"
    },
    {
      comment_id:234,
      text:"sdf"
    }
    ......,
    (63000 elements)
  ]
}

The totalIndexSize of the comment_id index is 104GB. I have around 100 documents out of 52M which have 63000 elements in comments array.

My aim is to delete the old comments and reduce the size of the comments array by over 80%. Earlier when I tried to update the document using this query

db.user.updateOne({_id:_id},{$set: {"comments":newCommentsArray}},upsert=True)

here the newCommentsArray will be of size around 400. This operation took around 130 sec to execute.

My Questions are:

1) What could be the reason update query above took 130sec. Is it because of huge unique index size on comment_id field? (I believe that updating the comments array with new comments array will try to rearrange the index for all the deleted 63000 elements and insert the new elements in the index.)

2) I had an other approach use $pull which is basically pulling 100 comments from the comments array and waiting for 5 sec and then execute for next batch of 100 comments. What do you think of this solution.

3) If the above solution is no good can you suggest a good way to reduce the comments array by over 80%.

Solution

You have a huge index for comment_id it's because you have Multikey Index

MongoDB creates an index key for each element in the array.

In your case, _id index has ~1GB size, comment_id is avg ~100/per document (to get ~104GB)

1) What could be the reason update query above took 130sec

Mongodb stores indexes with B-tree structure. B-tree properties:

Algorithm   Average     Worst case
Space       O(n)        O(n)
Search      O(log n)    O(log n)
Insert      O(log n)    O(log n)
Delete      O(log n)    O(log n)

It means, to insert indexes for comments, MongoDB needs iterate O(log n) (~25 iteration for each item) in the worst case.

2) I had an other approach use $pull which is basically pulling 100 comments from the comments array and waiting for 5 sec and then execute for next batch of 100 comments.

As comments are indexed, it will be fast (remember O (log n) property). It's not necessary wait for 5 secs, because since MongoDB 3.0 it uses multi-granularity locking which means locks only affected documents.

Also, you may reduce with $push operator like this:

db.user.update({ },{$push: {comments: {$each: [ ], $slice: -400}}})

This will insert [ ] (in this case 0 item) items and slice 400 items from the end

3) If the above solution is no good can you suggest a good way to reduce the comments array by over 80%.

Even if you reduce comment array, WiredTiger won't release unneeded disk space to the operating system.

Running dropIndex

db.user.dropIndex({ "comment_id" : 1 })

Warning: Since v4.2 obtains an exclusive lock on the specified collection for the duration of the operation. All subsequent operations on the collection must wait until db.collection.dropIndex() releases the lock.

Prior to v4.2 this command obtains a write lock on the affected database and will block other operations until it has completed.

Or running compact

Warning: compact blocks operations for the database it is currently operating on. Only use compact during scheduled maintenance periods. Also, you must authenticate as a user with the compact privilege action on the target collection