Search code examples
pythonsqlmongodbpymongoprojection

MongoDB: Remove duplicate records from Projection


How can I remove duplicate records from mongoDB projection ? Lets say I have My mongo documents in following form -

{"_id":"55555454", "From":"Bob", "To":"Alice", "subject":"Hi", "date":"04102011"} 
{"_id":"55555455", "From":"Bob", "To":"Dave", "subject":"Hello", "date":"04102014"}
{"_id":"55555456", "From":"Bob", "To":"Alice", "subject":"Bye", "date":"04112013"}

When I do a simple projection db.col.find({}, {"From":1, "To":1, "_id"=0})

which will obviously give me all three records like this.

{"From":"Bob", "To":"Alice"} {"From":"Bob","To":"Dave"} {"From":"Bob", "To":"Alice"}

However What I want is only two records, this way -

{"From":"Bob", "To":"Alice"} {"From":"Bob","To":"Dave"}

As My application is in python currently (using pymongo), what I am doing is that I am removing duplicate in the application from the list of records using

result = [dict(tupleized) for tupleized in set(tuple(item.items()) for item in l)]

Is there any DB method which I can apply to the projection and gives me only two records.


Solution

  • You can't do a reduction and eliminate duplicate documents using just find with MongoDB and a projection.

    The find commands won't work as you need remember that it's returning a cursor to the client and as such, can't reduce the results to only those documents that are unique without a secondary pass.

    Using this as test data (removed the _id):

    > db.test.find()
    { "From" : "Bob", "To" : "Alice", "subject" : "Hi", "date" : "04102011" }
    { "From" : "Bob", "To" : "Dave", "subject" : "Hello", "date" : "04102014" }
    { "From" : "Bob", "To" : "Alice", "subject" : "Bye", "date" : "04112013" }
    { "From" : "Bob", "To" : "Alice", "subject" : "Hi", "date" : "04102011" }
    { "From" : "Bob", "To" : "Dave", "subject" : "Hello", "date" : "04102014" }
    { "From" : "Bob", "To" : "Alice", "subject" : "Bye", "date" : "04112013" }
    { "From" : "Bob", "To" : "Dave", "subject" : "Hello", "date" : "04102014" }
    { "From" : "Bob", "To" : "Alice", "subject" : "Bye", "date" : "04112013" }
    { "From" : "George", "To" : "Carl", "subject" : "Bye", "date" : "04112013" }
    { "From" : "David", "To" : "Carl", "subject" : "Bye", "date" : "04112013" }
    

    You could use aggregation:

    > db.test.aggregate({ $group: { _id: { "From": "$From", "To": "$To" }}})
    

    Results:

    {
        "result" : [
            {
                "_id" : {
                        "From" : "David",
                        "To" : "Carl"
                }
            },
            {
                "_id" : {
                        "From" : "George",
                        "To" : "Carl"
                }
            },
            {
                "_id" : {
                        "From" : "Bob",
                        "To" : "Dave"
                }
            },
            {
                "_id" : {
                        "From" : "Bob",
                        "To" : "Alice"
                }
            }
    ],
        "ok" : 1
    }
    

    The Python code should look very similar to the aggregation pipeline suggested above.