I have a collection which contains documents with nested arrays. There are 1 million documents with thousands of objects in subfield
array. Documents are quite large, but for purpose of explaining consider the following two documents:
[
{
"id": "myid",
"field": {
"total": 1,
"subfield": [
{
"somefield": "1000",
"time": "2020-08-06T08:33:57.977+0530",
"val": [
{
"x": "someval",
"a": "val1",
"b": "val2"
}
]
},
{
"somefield": "2000",
"time": "2020-05-08T04:13:27.977+0530",
"val": [
{
"x": "someval2",
"a": "val1",
"b": "val2"
}
]
}
]
}
},
{
"id": "myid2",
"field": {
"total": 1,
"subfield": [
{
"somefield": "1001",
"time": "2020-07-31T10:15:50.184+0530",
"val": [
{
"x": "someval2",
"a": "val1",
"b": "val2"
},
{
"x": "someval2",
"a": "val1",
"b": "val2"
}
]
}
]
}
}
]
Use case:
I need to project only the id
of documents with time
date(grouped by date) greater than a value and field. subfield.val.b
or field. subfield.val.a
with a particular value.
I have the query to achieve my use case using $unwind
,$toDate
,$dateToString
operators.
But the use of $unwind
for large arrays causes the overall set to use a lot of memory and slows things down. It takes more than 15 minutes now.(I have not added any indexes because even if I create indexes for created
, when I run explain for aggregation, the winning query does not use the provided index)
My current query:
db.collection.aggregate([
{
$unwind: {
path: "$field.subfield",
}
},
{
$unwind: {
path: "$field.subfield.val",
}
},
{
$addFields: {
created_at: {
$toDate: "$field.subfield.time"
}
}
},
{
$match: {
$and: [
{
$expr: {
$gt: [
{
"$dateToString": {
"date": "$created_at",
"format": "%Y-%m-%d"
}
},
"2020-04-28"
]
}
},
{
$or: [
{
"field.subfield.val.a": {
"$eq": "val1"
}
},
{
"field.subfield.val.b": {
"$eq": "val1"
}
}
]
}
]
}
},
{
$group: {
_id: "$id"
}
}
])
I need to limit the execution time of the query to less than 30 seconds. I hope the process can be made faster if done without $unwind
.
My MongoDB Server version is 4.0.3
What are the other possible optimisations that can be done?
Thanks!
Possible optimisations that can be done:
$unwind
slows down the query, $filter
can be used to get matching results from nested array.db.collection.aggregate([
{
"$project": {
"obj1": {
"$filter": {
"input": "$field.subfield",
"as": "el",
"cond": {
"$and": [
{
"$gt": [
"$$el.time",
new Date("2020-04-29")
]
}
]
}
}
},
id: 1,
}
},
{
$match: {
$or: [
{
"obj1.val.a": {
"$eq": "val1"
}
},
{
"obj1.val.b": {
"$eq": "val1"
}
}
]
}
},
{
$project: {
id: 1,
_id: 0
}
}
])