Search code examples
mongodbmongodb-indexes

big data query mongodb, aggregation, single index or compound index


i am trying to speed up a query that im performing to a collection that contains more than 10 million of documents. An example of document looks like this

    {
        nMove: 2041242,
        typeMove: 'Sold',
        date: "2016-05-18T16:00:00Z",
        operation: 'output',
        origin: {
            id: '3234fds32fds42',
            name: 'Main storage',
        },
        products: [{
           id: '342fmdsff23324432',
           name: 'Product 1',
           price: 34,
           quantity: 9
        }],
    }

Now i have to query all the documents that $match with a given 'product.id' or with 'origin.id' or both, and $sum the quantity of product.quantity in total.

So i am performing a query like this.

 movesModel.aggregate([
    {
        $match: {
            $expr: {
                $and: [
                    { $in: [req.params.idProduct, '$product.id'] },
                    { $eq: ['$origin.id', req.params.idOrigin }] },
                ]
            }
        }
    },
    {
        $project: {
            _id: 0,
            outputs: {
                $sum: {
                    $cond: { if: { $eq: ['$operation', 'input'] }, then: '$product.quantity', else: 0 }
                }
            },
            inputs: {
                $sum: {
                    $cond: { if: { $eq: ['$operation', 'output'] }, then: '$product.quantity', else: 0 }
                }
            }
        }
    },
    {
        $group: {
            _id: '$_id',
            inputs: { $sum: '$inputs' },
            outputs: { $sum: '$outputs' }
        }
    },

]).then((result) => {
    res.json(result)
})

This query is taking like about 1 minute to be resolved... Sometimes this query $match with more than 200k documents... considering that i don't need the whole data, i just need the sum of quantity... i have some questions... ( I am mongodb noob )

  1. About indexes.. i created a compound index db.moves.createIndex({ 'origin.id': 1, 'product.id':1}). Is it correct? should i change it?

  2. is my query ok? can i improve it?

  3. To prevent that a query match with 200k documents... i did something tricky. i added a field called 'date', and i would like to take all the documents that match with the 'origin.id', 'product.id' and is $gte: date , but it takes the same time... even when it only match 1 document...

  4. Finishing... i think, that all the problem that i have is about indexes.. so i tried to check my indexStats... but it seems to be not working with my aggregation query.

Any help is appreciated. thank you

////////////FULL PIPELINE////////////

In this case i have two more collections called 'storages' and 'inventories'

//storage examples
    {
     _id: '3234fds32fds42'
     name: 'Main storage'
     status: true
    }
    {
    _id: '32f32f32432sda'
    name: 'Other storage'
    status: true
    }

//invetories examples
{
    _id: 'fvavcsa3a3aa3'
    date: '2020-01-01'
    storage: {
             _id: '3234fds32fds42'
            name: 'Main storage'
             }
    products: [{
               id: '342fmdsff23324432',
               name: 'Product 1',
            }],
}

So this is why i was using $lookup, what i really need is get all moves, that match with each storage and product.

//also i added invetories to filter by date and prevent to match tons of documents

So this is my query.

    storagesModel.aggregation([
     {
       $match: { status: true }
     },
     {
            $lookup: {
                from: 'inventories',
                as: 'inventory',
                let: { "idStorage": "$_id" },
                pipeline: [
                    {
                        $match: {
                            $expr: {
                                $and: [
                                    { $eq: ['$storage._id', { $toString: "$$idStorage" }] },
                                    { $in: [req.params.id, '$products._id'] }
                                ]
                            }
                        },
                    },
                    {
                        $sort: { date: -1 } // TO TAKE THE LAST INVENTORY
                    },
                    {
                        $limit: 1
                    }

                ]
            }
     },
     { $unwind: { path: '$inventories', preserveNullAndEmptyArrays: true } }, //DECONSTRUCT THE ARRAY AND GET IT AS OBJECT
     {
        $lookup: {
              from: 'moves',
              as: 'moves',
              let: { 
              "idStorage": "$_id",
              'date': '$inventory.date'},
              pipeline: [
                         {
                          $match: {
                                 $expr: {
                                       $and: [
                                          { $gte: ['$date', $$date] } 
                                          { $eq: ['$origin.id', '$$idStorage' }] },
                                          { $in: [req.params.idProduct, '$product.id'] },                                         
                                             ]
                                        }
                                  }
                          },
                          {
                            $project: {
                                     _id: 0,
                                     outputs: {
                                            $sum: {
                                                 $cond: { if: { $eq: ['$operation', 'input'] }, then: '$product.quantity', else: 0 }
                                                   }
                                               },
                                     inputs: {
                                            $sum: {
                                                 $cond: { if: { $eq: ['$operation', 'output'] }, then: '$product.quantity', else: 0 }
                                                   }
                                             }
                                        }
                         },
                         {
                          $group: {
                          _id: '$_id',
                          inputs: { $sum: '$inputs' },
                          outputs: { $sum: '$outputs' }
                          }
                   },
             ]
        }
    ])

Summarizing...

What i need to get, is the total of 'moves' that a product has in every storage. Taking in to account that maybe there is an 'inventory' that can give you a date to prevent match a lot of documents. So that why i am using the storageModel and use $lookup stage for get the last inventory, so i have the $$date and $$idStorage. then i use $lookup from 'moves' collection... well i know is a heavy query but... i thought that giving a date and with a proper compound index it should be fast... but even when i try to get the 'moves' for a product that has only few 'moves'... it takes like 20 or 30 seconds...

I tried doing a single query without $lookup and even matching with 400k documents it takes 1-2 sec...

what do you think about? Thanks for your help


Solution

  • Overall you did a good job, let's go over your points and talk about them separately.

    1. Yes, this is great, a compound index is exactly what you need for this pipeline. I'm sure you've read about compound indexes before choosing to create one so I won't go in depth to why this index is optimal as it's trivial.

    2. Let's leave this one to the end.

    3. This is great, if you don't care about the entire data sample but just recent data this is the way to go, now in order to utilise this field properly to speed up performance you should dump the old index we discussed in part 1 and create a new compound index to include this field aswell, {date: -1, 'origin.id': 1, 'product.id': 1} notice we select a decreasing index for date as we want the most recent data. this will make it more efficient.

    As you must have read compound index field order matters, so feel free to change the order of this one to match which ever queries you do most.

    1. Unlikely however the way Mongo generates index tree's is not stable, by that I mean that if you created your index long ago and since then more data came in you might benefit from dropping and re-building the index. with that said I do not recommend this as I feel any improvement will be somewhat minor in your case.

    2. (2) so back to your query, first two things I want to ask: (a) you used $and in your match query but by your worded description it seems that $or logic is more suited. this is a quick change and for you to do if needed. (b) again i'm not sure if it's a mistake or not but you seem to have switched the input's into `output's and vice versa. if this is the case you should switch them.

    With that said here is how I would re-write this query (spoilers not much has changed):

    movesModel.aggregate([
        { // notice i'm using Mongo's dot notation, $expr is also fine. not sure if there's an efficiency difference
            $match: {
                $and: [
                    {
                        $or: [
                            {
                                "product.id": req.params.idProduct
                            },
                            {
                                "origin.id": req.params.idOrigin
                            }
                        ]
                    },
                    {
                        date: {$gt: new Date("2020-01-01")}
                    }
                ]
            }
        },
        {  // there's no need for the project stage as we can just nest the condition into the $group, again this should not case
            // performance changes. also i switched the input to match with the inputs.
            $group: {
                _id: '$_id',
                inputs: {$sum: {$cond: {if: {$eq: ['$operation', 'input']}, then: '$product.quantity', else: 0}}},
                outputs: {$sum: {$cond: {if: {$eq: ['$operation', 'output']}, then: '$product.quantity', else: 0}}}
            }
        },
    ])
    

    So to recap your pipeline is mostly optimal, your suspicion to issue being index related is somewhat correct. once you build the new index from part 3 there's going to be quite a large performance change.

    Something to consider is Scale goes up, your database will (hopefully) keep growing. your current solution is fine for now but eventually it will cave under scale and performance will drop again. 2 easy options that come mind are:

    1. Preprocess, every update or insert you make have a pre-calculated collection that will update with those actions and save the required metrics.

    2. Create a "current" collection to only contain recent data and query that one.

    Both these will obviously create some overhead it's for you to choose if and when to implement them.