Search code examples
mongodbcountdistinct

Mongodb print count of unique values from multiple fields


I got the following documents of a collection (let's name it myCollection):

{
    "_id": {
        "$oid": "601a75a0c9a338f09f238816"
    },
    "Sample": "lie50",
    "Chromosome": "chr10",
    "Position": {
        "$numberLong": "47663"
    },
    "Reference": "C",
    "Mutation": "T",
    "Run": "Run_test",
    "SYMBOL": "TUBB8"
},
{
    "_id": {
        "$oid": "601a75a0c9a338f09f238817"
    },
    "Sample": "lie50",
    "Chromosome": "chr10",
    "Position": {
        "$numberLong": "47876"
    },
    "Reference": "T",
    "Mutation": "C",
    "Run": "Run_test",
    "SYMBOL": "TUBB8"
},
{
    "_id": {
        "$oid": "601a75a0c9a338f09f238818"
    },
    "Sample": "lie50",
    "Chromosome": "chr10",
    "Position": {
        "$numberLong": "48005"
    },
    "Reference": "G",
    "Mutation": "A",
    "Run": "Run_test",
    "SYMBOL": "TUBB8"
},
{
    "_id": {
        "$oid": "601a75a0c9a338f09f238819"
    },
    "Sample": "lie12",
    "Chromosome": "chr10",
    "Position": {
        "$numberLong": "48005"
    },
    "Reference": "G",
    "Mutation": "A",
    "Run": "Run_test",
    "SYMBOL": "TUBB8"
}

I am interested in printing the distinct count of the values from the fields Chromosome, Position, Reference, and Mutation. This means to count the unique fields of the following entries:

"Chromosome": "chr10", "Position": 47663, "Reference": "C", "Mutation": "T"
"Chromosome": "chr10", "Position": 47876, "Reference": "T", "Mutation": "C"
"Chromosome": "chr10", "Position": 48005, "Reference": "G", "Mutation": "A"
"Chromosome": "chr10", "Position": 48005, "Reference": "G", "Mutation": "A"

which should be 3 distinct rows.

I have checked multiple questions like this one on how to print the distinct values for one field or using $unwind/$project.

For the latter, I thought why not concatenate the 4 fields and then print the number using length with $unwind/$project?

I managed to get that far:

db.myCollection.aggregate(
[
  {
    $group:
    {
      _id: null,
      newfield: {
        $addToSet:
        {
          $concat:
          [
            "$Chromosome",
            "_",
            {"$toString":"$Position"},
            "_",
            "$Reference",
            "_",
            "$Mutation"
          ]
        }
      }
    }
  },
  {
    $unwind: "$newfield"
  },
  { 
    $project: { _id: 0 }
  }
]).length

However, adding .length to this query does not return anything but without returns:

{ "newfield" : "chr10_47663_C_T" }
{ "newfield" : "chr10_47876_T_C" }
{ "newfield" : "chr10_48005_G_A" }

For information, my actual data contains 2 billion documents.


Solution

  • The fields should pass in _id in $group stage, and also use $count stage to get total elements instead of returning all documents,

    db.myCollection.aggregate([
      {
        $group: {
          _id: {
            Chromosome: "$Chromosome",
            Position: "$Position",
            Reference: "$Reference",
            Mutation: "$Mutation"
          }
        }
      },
      { $count: "count" }
    ])
    

    Playground