How can I remove duplicate records from mongoDB projection ? Lets say I have My mongo documents in following form -
{"_id":"55555454", "From":"Bob", "To":"Alice", "subject":"Hi", "date":"04102011"}
{"_id":"55555455", "From":"Bob", "To":"Dave", "subject":"Hello", "date":"04102014"}
{"_id":"55555456", "From":"Bob", "To":"Alice", "subject":"Bye", "date":"04112013"}
When I do a simple projection
db.col.find({}, {"From":1, "To":1, "_id"=0})
which will obviously give me all three records like this.
{"From":"Bob", "To":"Alice"} {"From":"Bob","To":"Dave"} {"From":"Bob", "To":"Alice"}
However What I want is only two records, this way -
{"From":"Bob", "To":"Alice"} {"From":"Bob","To":"Dave"}
As My application is in python currently (using pymongo), what I am doing is that I am removing duplicate in the application from the list of records using
result = [dict(tupleized) for tupleized in set(tuple(item.items()) for item in l)]
Is there any DB method which I can apply to the projection and gives me only two records.
You can't do a reduction and eliminate duplicate documents using just find
with MongoDB and a projection.
The find
commands won't work as you need remember that it's returning a cursor to the client and as such, can't reduce the results to only those documents that are unique without a secondary pass.
Using this as test data (removed the _id
):
> db.test.find()
{ "From" : "Bob", "To" : "Alice", "subject" : "Hi", "date" : "04102011" }
{ "From" : "Bob", "To" : "Dave", "subject" : "Hello", "date" : "04102014" }
{ "From" : "Bob", "To" : "Alice", "subject" : "Bye", "date" : "04112013" }
{ "From" : "Bob", "To" : "Alice", "subject" : "Hi", "date" : "04102011" }
{ "From" : "Bob", "To" : "Dave", "subject" : "Hello", "date" : "04102014" }
{ "From" : "Bob", "To" : "Alice", "subject" : "Bye", "date" : "04112013" }
{ "From" : "Bob", "To" : "Dave", "subject" : "Hello", "date" : "04102014" }
{ "From" : "Bob", "To" : "Alice", "subject" : "Bye", "date" : "04112013" }
{ "From" : "George", "To" : "Carl", "subject" : "Bye", "date" : "04112013" }
{ "From" : "David", "To" : "Carl", "subject" : "Bye", "date" : "04112013" }
You could use aggregation:
> db.test.aggregate({ $group: { _id: { "From": "$From", "To": "$To" }}})
Results:
{
"result" : [
{
"_id" : {
"From" : "David",
"To" : "Carl"
}
},
{
"_id" : {
"From" : "George",
"To" : "Carl"
}
},
{
"_id" : {
"From" : "Bob",
"To" : "Dave"
}
},
{
"_id" : {
"From" : "Bob",
"To" : "Alice"
}
}
],
"ok" : 1
}
The Python code should look very similar to the aggregation pipeline suggested above.