Search code examples
mongodbaggregation-framework

Is $push+$reduce really slower than a custom accumulator?


Let's say I have a few million documents in a MongoDB collection that look like this:

[
  {
    "timestamp": "2024-01-01T00:00:00Z",
    "schema": "1.0.0",
    "value": 3,
  },
  {
    "timestamp": "2024-01-01T01:00:00Z",
    "schema": "1.2.0",
    "value": -10,
  },
  ...
]

Now, I want to do the following, using an aggregation pipeline:

So, the desired output is something like:

[
    // January bucket
  {
    "bucket": "2024-01-01T00:00:00Z",
    "value": {
      "timestamp": "2024-01-01T01:00:00Z",
      "schema": "1.2.0",
      "absMax": -10
    }
  }
]

Obviously, the default $max accumulator does not work, as it has two problems:

  • It can not keep the original sign when querying absolute maxes
  • It does not include timestamp and schema, but only outputs the numeric value

So, I tackled the problem trying two different ways:

  1. In my $group stage, I use $push to push all raw documents into a $raw document, which I then go through with $reduce. I need the $raw document to have timestamp and schema always available.
  2. In my $group stage, I use a custom accumulator function (see https://www.mongodb.com/docs/manual/reference/operator/aggregation/accumulator/) that reduces each document and keeps the original timestamp and schema next to the absolute maximum in its state.

Now, I encounter the following issues:

  • Solution 1 runs into memory issues, as pushing millions of documents into RAM exceeds MongoDB's hard limit of 100MB of RAM per aggregation step
  • Solution 2 does look quite ugly in code and MongoDB advises against using custom accumulator functions written in JavaScript.

I updated my question with MongoDB playgrounds:

And for the sake of completeness using only $min and $max, but losing timestamp and schema: https://mongoplayground.net/p/UegNExWo2np

Solution 2 is about twice as fast as solution 1 on big data sets.

Am I overlooking something?


Solution

  • You can group easier with $dateTrunc. For min and max value you can use $bottom or $first

    db.collection.aggregate([
      { $set: { absValue: { $abs: "$value" } } },
      {
        $group: {
          _id: {
            $dateTrunc: {
              date: "$timestamp",
              unit: "month",
              timezone: "Europe/Zurich"
            }
          },
          maxValue: {
            $bottom: {
              sortBy: { absValue: 1 },
              output: {
                timestamp: "$timestamp",
                schema: "$schema",
                value: "$value"
              }
            }
          }
        }
      }
    ])
    

    For the minValue use either $top or change the sort order to sortBy: { absValue: -1 }

    If you like to use $first and $last then you need to sort the entire collection, i.e.

    db.collection.aggregate([
      { $set: { absValue: { $abs: "$value" } } },
      { $sort: { absValue: 1 } },
      {
        "$group": {
          "_id": {
            $dateTrunc: {
              date: "$timestamp",
              unit: "month",
              timezone: "Europe/Zurich"
            }
          },
          value: {
            $last: {
              timestamp: "$timestamp",
              schema: "$schema",
              value: "$value"
            }
          }
        }
      }
    ])
    

    But I think this will be slower.

    The rest is just some cosmetic.