I got the following documents of a collection (let's name it myCollection
):
{
"_id": {
"$oid": "601a75a0c9a338f09f238816"
},
"Sample": "lie50",
"Chromosome": "chr10",
"Position": {
"$numberLong": "47663"
},
"Reference": "C",
"Mutation": "T",
"Run": "Run_test",
"SYMBOL": "TUBB8"
},
{
"_id": {
"$oid": "601a75a0c9a338f09f238817"
},
"Sample": "lie50",
"Chromosome": "chr10",
"Position": {
"$numberLong": "47876"
},
"Reference": "T",
"Mutation": "C",
"Run": "Run_test",
"SYMBOL": "TUBB8"
},
{
"_id": {
"$oid": "601a75a0c9a338f09f238818"
},
"Sample": "lie50",
"Chromosome": "chr10",
"Position": {
"$numberLong": "48005"
},
"Reference": "G",
"Mutation": "A",
"Run": "Run_test",
"SYMBOL": "TUBB8"
},
{
"_id": {
"$oid": "601a75a0c9a338f09f238819"
},
"Sample": "lie12",
"Chromosome": "chr10",
"Position": {
"$numberLong": "48005"
},
"Reference": "G",
"Mutation": "A",
"Run": "Run_test",
"SYMBOL": "TUBB8"
}
I am interested in printing the distinct count of the values from the fields Chromosome
, Position
, Reference
, and Mutation
. This means to count the unique fields of the following entries:
"Chromosome": "chr10", "Position": 47663, "Reference": "C", "Mutation": "T"
"Chromosome": "chr10", "Position": 47876, "Reference": "T", "Mutation": "C"
"Chromosome": "chr10", "Position": 48005, "Reference": "G", "Mutation": "A"
"Chromosome": "chr10", "Position": 48005, "Reference": "G", "Mutation": "A"
which should be 3
distinct rows.
I have checked multiple questions like this one on how to print the distinct values for one field or using $unwind/$project.
For the latter, I thought why not concatenate the 4 fields and then print the number using length
with $unwind/$project
?
I managed to get that far:
db.myCollection.aggregate(
[
{
$group:
{
_id: null,
newfield: {
$addToSet:
{
$concat:
[
"$Chromosome",
"_",
{"$toString":"$Position"},
"_",
"$Reference",
"_",
"$Mutation"
]
}
}
}
},
{
$unwind: "$newfield"
},
{
$project: { _id: 0 }
}
]).length
However, adding .length
to this query does not return anything but without returns:
{ "newfield" : "chr10_47663_C_T" }
{ "newfield" : "chr10_47876_T_C" }
{ "newfield" : "chr10_48005_G_A" }
For information, my actual data contains 2 billion documents.
The fields should pass in _id
in $group
stage, and also use $count
stage to get total elements instead of returning all documents,
db.myCollection.aggregate([
{
$group: {
_id: {
Chromosome: "$Chromosome",
Position: "$Position",
Reference: "$Reference",
Mutation: "$Mutation"
}
}
},
{ $count: "count" }
])