Search code examples
javamongodbmongodb-querymongo-java

How to remove duplicate from mongodb when there is not unique key in collection.?


How should I remove duplicate from mongodb collection when there is no unique element?

I want to do this in using Java driver. In that below pic some record are same. I want to remove that records. Time is not unique key here.

enter image description here

P.S.: I just presented data in table form. there are actually in json array form.


Solution

  • I agree with other users here who have pointed out that the presence of duplicate documents might indicate some problem with your application, and that eliminating duplicates before they are inserted is better than trying to clean them up later. You should ensure that the duplicates truly are meaningless and try to identify their source, as a higher priority than cleaning them up.

    That said, the meaning of "duplicate" here seems to be "the value of every single field (except _id) is the same". So, to eliminate duplicates, I would do the following:

    1 Iterate over every document in the collection, possibly in parallel using a parallel collection scan

    2 Compute a hash of all of the non-_id fields

    3 Insert a document into another collection representing a set of duplicates

    {
        "_id" : #hash#,
        "docs" : [#array of _ids of docs],
        "count" : #number of _ids in docs array#
    }
    

    then you'll have a record of all duplicates and you can iterate over this collection and remove all but one of the duplicates, for each document with count > 1. Alternatively, if you don't want to bother to keep a record of the duplicates, you can insert a doc with the hash as _id, and whenever there's a hash collision, delete the current document because it's a duplicate (with high probability).