I have a collection which has redundant data.
Example Data:
{
unique_index : "1"
other_field : "whatever1"
},
{
unique_index : "2"
other_field : "whatever2"
},
{
unique_index : "1"
other_field : "whatever1"
}
I ran the query: (I have to use allowDiskUse:true
because there is lot of data)
db.collection.aggregate([
{
$group: {
_id: "$unique_index",
count: { $sum: 1 }
}
},
{ $match: { count: { $gte: 2 } } }
], { allowDiskUse: true })
I get this output: (for example)
{ "_id" : "1", "count" : 2 }
.
.
Now the problem is that I want to keep only one data. I want to delete all redundant data. Please note that its lot of data, like more than 100,000 records or something. I am searching for fast and easy solution (in mongodb or RoR because I am using Ruby on Rails), if any one can help, would be appreciated.
If you don't care about _id
, the simplest way is to select distinct documents into new collection, and then rename it:
db.collection.aggregate([
{$group: {
_id: "$unique_index",
other_field: {$first: "$other_field"}
}},
{$project: {
_id: 0,
unique_index: "$_id",
other_field:1
}},
{$out: "new_collection"}
]);
db.new_collection.renameCollection("collection", true);
Please bear in mind, you will need to restore all indexes. Also renameCollection
is not working on sharded colelctions.