I have a mongo aggregate query like below:
db.someCollection.aggregate([
{
$match: { taskId: "qy7u17-xunwqu" }
},
// Group by "tracklet_id" and calculate count for each group
{
$group: {
_id: '$tracklet_id',
count: { $sum: 1 },
representativeImage: { $first: '$img' }, // when I remove this, the query is done in a split second
timestamp: { $max: '$timestamp' },
},
},
{
$project: {
_id: 0,
trackletId: '$_id',
image: '$representativeImage', // but at the end, I want one representative image for a tracklet, doesn't matter which one.
timestamp: 1,
count: 1,
},
},
{
$sort: {
timestamp: -1
}
},
{
$limit: 20
},
], {allowDiskUse: true})
the image field contains long b64 strings that take a lot of memory, causing the group and sort stage of the query to move to disk. Is there a way that I can add a pipeline step before project to re-include an image field for each tracklet?
One alternative I can think of is to just do a separate query afterwards to get the images and combine the results, but I'm hoping there is a more elegant way to do it in the same aggregate query.
As verified by the OP.
Instead of having the image in the $group
stage get the image from a self-lookup where you limit only to 1 doc (since we only need any single image and to save memory) that matches the same trackletid
db.someCollection.aggregate([
{ $match: { taskId: "qy7u17-xunwqu" } },
{
$group: {
_id: "$tracklet_id",
count: { $sum: 1 },
timestamp: { $max: "$timestamp" }
}
},
{ $project: { _id: 0, trackletId: "$_id", timestamp: 1, count: 1 } },
{ $sort: { timestamp: -1 } },
{ $limit: 20 },
{
$lookup: {
from: "someCollection",
let: { trackletId: "$trackletId" },
pipeline: [
{ $match: { $expr: { $eq: [ "$tracklet_id", "$$trackletId" ] } } },
{ $limit: 1 },
{ $project: { _id: 0, img: 1 } }
],
as: "image"
}
},
{ $addFields: { image: { $arrayElemAt: [ "$image.img", 0 ] } } }
])