Search code examples
mongodbmongoosemongodb-queryaggregation-frameworkmongodb-aggregation

Deduping results returned by a Mongo aggregate query


Some background:

This involves 3 collections:

  1. posts
  2. postsubcategories
  3. postsupercategories


Example of a document in posts:

{
    "_id" : ObjectId("57fbf3ce7ccbc906ed87cef6"),
    "__v" : 6,
    "author" : ObjectId("57fbe2ac3cfb9e061df86ebb"),
    "postSubCategories" : [ 
        ObjectId("5806344baa0bbf284a2316e4")//reference to document in postsubcategories collection
    ],
    "postSuperCategories" : [ 
        ObjectId("580679958a5f5f448ba5aae9"), 
        ObjectId("580679958a5f5f448ba5aaf2")//references to documents in postsupercategories collection
    ],
    "publishedDate" : ISODate("2016-10-10T04:00:00.000Z"),
    "state" : "published",
    "templateName" : ObjectId("57fbf3977ccbc906ed87cef3"),
    "title" : "My title",
    "topics" : []}

My query is

db.posts.aggregate([
{'$unwind': 
    {'path':"$postSubCategories"}
},
{'$lookup': {
  'from':"postsubcategories",
  'localField': "postSubCategories",
  'foreignField': "_id",
  'as': "subObject"
}},
{'$unwind': 
    {'path':"$postSuperCategories"}
},
{'$lookup': {
  'from':"postsupercategories",
  'localField': "postSuperCategories",
  'foreignField': "_id",
  'as': "superObject"
}},
{'$match': {
    '$or':
        [{ "subObject.searchKeywords": "home monitor" }, 
        { "superObject.searchKeywords": "home monitor" }]
    }
},
{'$match': {
    "state": "published"
}}


Both postsubcategories and postsupercategories collections contain a field called searchKeywords that is an array of text within their documents. I want to be able to query those searchKeywords fields and return the posts documents that match. I need a deduped set of returned _ids.

The query is returning four documents. Example:

ObjectId("57fbf3ce7ccbc906ed87cef6")
ObjectId("57fbf3ce7ccbc906ed87cef6")
ObjectId("57fbf40b7ccbc906ed87cef7")
ObjectId("57fbf40b7ccbc906ed87cef7") 


I understand why it is returning 4. One document contains postSubCategories object 5806344baa0bbf284a2316e4 and postSuperCategories id 580679958a5f5f448ba5aae9.

The second document contains postSubCategories object 5806344baa0bbf284a2316e4 and postSuperCategories 580679958a5f5f448ba5aaf2. This repeats for the second post

Is there a way that I can "dedupe" based on the _ids that are returned?

My end result would be:

ObjectId("57fbf3ce7ccbc906ed87cef6")
ObjectId("57fbf40b7ccbc906ed87cef7")

I know that technically the 2 matching _ids from the list of 4 are not the exact same because they each contain a different postSuperCategories object, but at this point I don't care about that field anymore and just need a singular posts document because I need access to the other fields.

Any help would be greatly appreciated. I have tried looking into $group, $addToSet, and $setUnion and have been unsuccessful so far.


Solution

  • You can add one $group retrieving distinct _id with the first value found for each of the properties you want to extract per _id.

    for the $group aggregation :

    {
        '$group': {
            _id: '$_id',
            item: { $first: "$$ROOT" } 
        }
    }
    

    which will give you the first item for root document in item field :

    { "_id" : ObjectId("57fbf40b7ccbc906ed87cef7"), "items" : { "_id" : ObjectId("57fbf40b7ccbc906ed87cef7"), "__v" : 6, "author" : ObjectId("57fbe2ac3cfb9e061df86ebb"), "postSubCategories" : ObjectId("5806344baa0bbf284a2316e4"), "postSuperCategories" : ObjectId("580679958a5f5f448ba5aae9"), "publishedDate" : ISODate("2016-12-10T04:00:00Z"), "state" : "published", "templateName" : ObjectId("57fbf3977ccbc906ed87cef4"), "title" : "My title2", "topics" : [ "a", "b" ], "subObject" : [ { "_id" : ObjectId("5806344baa0bbf284a2316e4"), "searchKeywords" : "home monitor" } ], "superObject" : [ { "_id" : ObjectId("580679958a5f5f448ba5aae9"), "searchKeywords" : "home monitor2" } ] } }
    { "_id" : ObjectId("57fbf3ce7ccbc906ed87cef6"), "items" : { "_id" : ObjectId("57fbf3ce7ccbc906ed87cef6"), "__v" : 6, "author" : ObjectId("57fbe2ac3cfb9e061df86ebb"), "postSubCategories" : ObjectId("5806344baa0bbf284a2316e4"), "postSuperCategories" : ObjectId("580679958a5f5f448ba5aae9"), "publishedDate" : ISODate("2016-10-10T04:00:00Z"), "state" : "published", "templateName" : ObjectId("57fbf3977ccbc906ed87cef3"), "title" : "My title", "topics" : [ ], "subObject" : [ { "_id" : ObjectId("5806344baa0bbf284a2316e4"), "searchKeywords" : "home monitor" } ], "superObject" : [ { "_id" : ObjectId("580679958a5f5f448ba5aae9"), "searchKeywords" : "home monitor2" } ] } }
    

    Otherwise, for selecting a field in the response :

    {
        '$group': {
            _id: '$_id',
            author: {
                $first: '$author'
            },
            publishedDate: {
                $first: '$publishedDate'
            },
            state: {
                $first: '$state'
            },
            templateName: {
                $first: '$templateName'
            },
            title: {
                $first: '$title'
            },
            topics: {
                $first: '$topics'
            }
        }
    }
    

    You will get something like :

    { "_id" : ObjectId("57fbf40b7ccbc906ed87cef7"), "author" : ObjectId("57fbe2ac3cfb9e061df86ebb"), "publishedDate" : ISODate("2016-10-10T04:00:00Z"), "state" : "published", "templateName" : ObjectId("57fbf3977ccbc906ed87cef3"), "title" : "My title", "topics" : [ ] }
    { "_id" : ObjectId("57fbf3ce7ccbc906ed87cef6"), "author" : ObjectId("57fbe2ac3cfb9e061df86ebb"), "publishedDate" : ISODate("2016-10-10T04:00:00Z"), "state" : "published", "templateName" : ObjectId("57fbf3977ccbc906ed87cef3"), "title" : "My title", "topics" : [ ] }