Search code examples
mongodbaggregation-framework

Is $push+$reduce really slower than a custom accumulator?


Let's say I have a few million documents in a MongoDB collection that look like this:

[
  {
    "timestamp": "2024-01-01T00:00:00Z",
    "schema": "1.0.0",
    "value": 3,
  },
  {
    "timestamp": "2024-01-01T01:00:00Z",
    "schema": "1.2.0",
    "value": -10,
  },
  ...
]

Now, I want to do the following, using an aggregation pipeline:

So, the desired output is something like:

[
    // January bucket
  {
    "bucket": "2024-01-01T00:00:00Z",
    "value": {
      "timestamp": "2024-01-01T01:00:00Z",
      "schema": "1.2.0",
      "absMax": -10
    }
  }
]

Obviously, the default $max accumulator does not work, as it has two problems:

  • It can not keep the original sign when querying absolute maxes
  • It does not include timestamp and schema, but only outputs the numeric value

So, I tackled the problem trying two different ways:

  1. In my $group stage, I use $push to push all raw documents into a $raw document, which I then go through with $reduce. I need the $raw document to have timestamp and schema always available.
  2. In my $group stage, I use a custom accumulator function (see https://www.mongodb.com/docs/manual/reference/operator/aggregation/accumulator/) that reduces each document and keeps the original timestamp and schema next to the absolute maximum in its state.

Now, I encounter the following issues:

  • Solution 1 runs into memory issues, as pushing millions of documents into RAM exceeds MongoDB's hard limit of 100MB of RAM per aggregation step
  • Solution 2 does look quite ugly in code and MongoDB advises against using custom accumulator functions written in JavaScript.

I updated my question with MongoDB playgrounds:

And for the sake of completeness using only $min and $max, but losing timestamp and schema: https://mongoplayground.net/p/UegNExWo2np

Solution 2 is about twice as fast as solution 1 on big data sets.

Am I overlooking something?


Solution

  • Now after some time and research, I want to follow up and post our solution here. We ended up using the built-in $top accumulator (docs):

    {
       $top:
          {
             sortBy: { <field1>: <sort order>, <field2>: <sort order> ... },
             output: <expression>
          }
    }
    

    We used sortBy to sort by the absolute of the field's value, maintaining the original sign in the output. For us, this gave us significant performance increase over the original ideas I posted.