Search code examples
mongodbmongodb-queryprojection

Querying and sorting a large amount of data in mongo, by fields which might not exist


I'm relatively new to mongo and I have a collection that looks like this:

[
    {
        "stored": {
            "box": [
                {
                    "parcelId": "uwb1",
                    "status": "ACTIVE"
                }
            ]
        },
        "checked": {
            "box": [
                {
                    "parcelId": "uwb1",
                    "status": "ACTIVE"
                }
            ]
        }
    },
    {
        "stored": {
            "box": [
                {
                    "parcelId": "aqrf123",
                    "status": "PENDING"
                }
            ]
        },
        "checked": {
            "box": [
                {
                    "parcelId": "aqrf123",
                    "status": "PENDING"
                }
            ]
        }
    },
    {
        "checked": {
            "box": [
                {
                    "parcelId": "zuz873",
                    "status": "ACTIVE"
                }
            ]
        }
    }
]

Some observations regarding the data:

  • the document will always have the checked field but might not have the stored field
  • the checked and stored fields have the same schema
  • both will always have the box field and we can make the assumption that the box field will always have 1 element in the array (only 1, not more, not less)
  • the amount of documents in this collection is relatively high (~100 mil)

What I'm trying to achieve is to get the documents sorted by that status field, which is like an enum, and it can have 3 values - ACTIVE, PENDING and REJECTED.

  • If for a document, the stored field exist, I'll take it from there and disregard the checked field.
  • Otherwise I'm going to have to take it from the checked field, which is guaranteed to exist, as previously mentioned.
  • An important requirement is to have the entire document returned to the consumer / client, so I cannot use projection to reduce the amount of data from the documents (which would probably make the whole operation faster).

How I've tried to achieve this is by using an aggregation which looks like this:

db.getCollection('entries')
    .aggregate([{
            $addFields: {
                sortStatus: {
                    $ifNull: [{
                        $let: {
                            vars: {
                                box: {
                                    $arrayElemAt: [
                                        "$stored.box", 0
                                    ]
                                }
                            },
                            in: "$$box.status"
                        }
                    }, {
                        $let: {
                            vars: {
                                box: {
                                    $arrayElemAt: [
                                        "$checked.box", 0
                                    ]
                                }
                            },
                            in: "$$box.status"
                        }
                    }]
                }
            }
        },
        {
            $sort: {
                sortStatus: 1
            }
        }
    ], {
        allowDiskUse: true
    })

Which seems to do the job but it feels quite slow. Also there's the allowDiskUse which makes me uncomfortable a bit. If I leave it out, I get the Sort exceeded memory limit of x bytes, but did not opt in to external sorting. Aborting operation. Pass allowDiskUse:true to opt in error message.

So my questions are:

  1. Are there faster alternatives, be it with or without aggregation?
  2. Are there any risks in using the allowDiskUse option when doing an aggregation?
  3. Would it be better (or is it the "mongo" way) to alter a bit the document structure and add that sortable field to the root of the document, add an index for it and just use .sort({"statusField": 1})? This would be the last resort option, as I'd have to migrate the existing data.

Solution

  • Your sortStatus field value can be arrived at by using the following:

    { $addFields: { sortStatus: { $ifNull: [ "$stored.box.status", "$checked.box.status" ] } } },
    

    Will this make the query faster? no, but the code is simpler.


    (1) Are there faster alternatives, be it with or without aggregation?

    I don't know, at this momemt.


    (2) Are there any risks in using the allowDiskUse option when doing an aggregation?

    Using the allowDiskUse:true option means that the sort operation uses the disk for additional resources when the memory (RAM) for this operation exceeds its limit. The disk IO is very slow compared to the memory, so the "risk" is a much slower sort operation. This option becomes mandatory when the sort operation requires more memory than the limit of 100MB (see the documentation on Sort and Memory Restrictions in Aggregation).


    (3) Would it be better (or is it the "mongo" way) to alter a bit the document structure and add that sortable field to the root of the document, add an index for it and just use .sort({"statusField": 1})? This would be the last resort option, as I'd have to migrate the existing data.

    Creating a new status field and an index on this field means new considerations:

    • Creating a new field "status" requires that additional computation at the time of writing the document (possibly during updates too).
    • Creating an index on this new field, also is an additional overhead during writes. Note the index size will be large as the number of documents is high.

    These can affect the write performance of your application.

    But, the query will become a simple sort. With a large number of documents in the collection, the index used for sorting may or may not fit the memory during operation. You cannot be sure how this option can help without some real trials.

    Here is some documentaion on Indexing Strategies.