Search code examples
ruby-on-railsmongodbmongodb-queryaggregation-frameworkmongodb-aggregation

Mongo DB, delete redundant data, how to delete duplicate unique index from the collection


I have a collection which has redundant data.

Example Data:

{
    unique_index : "1"
    other_field : "whatever1"
},
{
    unique_index : "2"
    other_field : "whatever2"
},
{
    unique_index : "1"
    other_field : "whatever1"
}

I ran the query: (I have to use allowDiskUse:true because there is lot of data)

db.collection.aggregate([
    {
        $group: { 
            _id: "$unique_index", 
            count: { $sum: 1 }
        } 
    }, 
    { $match: { count: { $gte: 2 } } }
], { allowDiskUse: true })

I get this output: (for example)

{ "_id" : "1", "count" : 2 }
.
.

Now the problem is that I want to keep only one data. I want to delete all redundant data. Please note that its lot of data, like more than 100,000 records or something. I am searching for fast and easy solution (in mongodb or RoR because I am using Ruby on Rails), if any one can help, would be appreciated.


Solution

  • If you don't care about _id, the simplest way is to select distinct documents into new collection, and then rename it:

    db.collection.aggregate([
        {$group: {
            _id: "$unique_index", 
            other_field: {$first: "$other_field"}
        }},
        {$project: {
            _id: 0,
            unique_index: "$_id",
            other_field:1
        }},
        {$out: "new_collection"}
    ]);
    
    db.new_collection.renameCollection("collection", true);
    

    Please bear in mind, you will need to restore all indexes. Also renameCollection is not working on sharded colelctions.