Search code examples
pythonmongodbpymongo-3.x

how to find and remove duplicates documents from mongodb collection using python


how to find and remove duplicates documents from mongodb using python. We have total 7 documents/records out of 2 documents/records are duplicate. so need to find those duplicate document/records and delete from same collection. In the documents we will have 100 attributes so we can not find documents based on few attributes. We need exact duplicate document/records

MongoDB collection

[
    { 'name': 'Amy', 'address': 'Apple st 652', 'age': 34 },
    { 'name': 'Hannah', 'address': 'Mountain 21', 'age': 34 },
    { 'name': 'Hannah', 'address': 'Mountain 21', 'age': 34 },
    { 'name': 'Amy', 'address': 'Apple st 652', 'age': 34 },
    { 'name': 'Richard', 'address': 'Sky st 331', 'age': 34 },
    { 'name': 'Chuck', 'address': 'Main Road 989', 'age': 34 },
    { 'name': 'Viola', 'address': 'Sideway 1633', 'age': 34 },
];

Output Collection

[
    { 'name': 'Amy', 'address': 'Apple st 652' },
    { 'name': 'Hannah', 'address': 'Mountain 21' },
    { 'name': 'Richard', 'address': 'Sky st 331' },
    { 'name': 'Chuck', 'address': 'Main Road 989' },
    { 'name': 'Viola', 'address': 'Sideway 1633' },
];

Solution

  • You can use the $$ROOT which represents the full document and group on that, like so:

    db.collection.aggregate([
      {
        $project: {
          _id: 0,
          
        }
      },
      {
        $group: {
          _id: "$$ROOT"
        }
      },
      {
        $replaceRoot: {
          newRoot: "$_id"
        }
      }
    ])
    

    As you can see we have to remove the _id field as it's unique and would ruin the approach, an additional point to consider is that this loads the entire collection into memory. If you machine can't handle it you have no other choice but to incorporate code and paginate over the collection.