Search code examples
pythonmongodbflaskmongoengineflask-mongoengine

Aggregation function for Counting of Duplicates in a field based on duplicate items in another field


I am using mongoengine as ORM with flask application. The model class is define like

class MyData(db.Document):
    task_id = db.StringField(max_length=50, required=True)
    url = db.URLField(max_length=500,required=True,unique=True)
    organization = db.StringField(max_length=250,required=True)
    val = db.StringField(max_length=50, required=True)

The field organization can be repeating and I want to get the count of duplicates with respect to values in another field. For example if the data in mongodb is like

[{"task_id":"as4d2rds5","url":"https:example1.com","organization":"Avengers","val":"null"},
 {"task_id":"rfre43fed","url":"https:example1.com","organization":"Avengers","val":"valid"},
 {"task_id":"uyje3dsxs","url":"https:example2.com","organization":"Metro","val":"valid"},
 {"task_id":"ghs563vt6","url":"https:example1.com","organization":"Avengers","val":"invalid"},
 {"task_id":"erf6egy64","url":"https:example2.com","organization":"Metro","val":"null"}]

Then I am querying all the objects using

data = MyData.objects()

I want a response like

[{"url":"https:example1.com","Avengers":{"valid":1,"null":1,"invalid":1}},{"url":"https:example2.com",Metro":{"valid":1,"null":1,"invalid":0}}]

I tried like

db.collection.aggregate([
  {
    "$group": {
      "_id": "$organization",
      "count": [
        {
          "null": {
            "$sum": 1
          },
          "valid": {
            "$sum": 1
          },
          "invalid": {
            "$sum": 1
          }
        }
      ]
    }
  }
])

but I am getting an error

The field 'count' must be an accumulator object

Solution

  • Maybe something like this:

    db.collection.aggregate([
    {
      "$group": {
      "_id": {
        k: "$organization",
        v: "$val"
       },
      "cnt": {
        $sum: 1
       }
      }
     },
     {
      $project: {
        _id: 0,
        k: "$_id.k",
        o: {
          k: "$_id.v",
          v: "$cnt"
        }
       }
      },
     {
       $group: {
        _id: "$k",
        v: {
          $push: "$o"
        }
      }
    },
    {
      $addFields: {
        v: {
          "$arrayToObject": "$v"
        }
      }
    },
    {
      $project: {
        _id: 0,
        new: [
          {
            k: "$_id",
            v: "$v"
          }
        ]
      }
    },
    {
      "$addFields": {
        "new": {
          "$arrayToObject": "$new"
        }
      }
    },
    {
     "$replaceRoot": {
       "newRoot": "$new"
     }
    }
    ])
    

    Explained:

    1. Group to count
    2. Project for arrayToObject
    3. Group to join the values
    4. arrayToObject one more time
    5. project additionally
    6. arrayToObject to form the final object
    7. project one more time
    8. replaceRoot to move the object to root.

    P.S. Please, note this solution is not showing the missing values if they do not exist , if you need the missing values additional mapping / mergeObjects need to be added

    playground1

    Option with missing values ( if possible values are fixed to null,valid,invalid) : just replace the second addFiedlds with:

       {
       $addFields: {
         v: {
        "$mergeObjects": [
          {
            "null": 0,
            valid: 0,
            invalid: 0
          },
          {
            "$arrayToObject": "$v"
          }
        ]
       }
      }
     }
    

    playground2

    ++url:

    playground3