Aggregation pipeline "latest for all distinct id" is very slow, need to create proper indexes?

Considering the following aggregation pipeline code to return newest entry for all distinct "internal_id":

db.locations.aggregate({$sort: {timestamp: -1}}, {$group: {_id: "$internal_id", doc: {$first: "$$ROOT"}}})

This call takes up to 10 seconds, which is not acceptable. The collection is not so huge:

db.locations.count()
1513671

So I guess there's something wrong with the indexes, however I tried to create many indexes and none of them made an improvement, currently I kept those two that were supposed to be enough imho: {timestamp: -1, internal_id: 1} and {internal_id: 1, timestamp: -1}.

MongoDB is NOT sharded, and running a 3 hosts replicaset running version 3.6.14.

MongoDB log show the following:

2020-05-30T12:21:18.598+0200 I COMMAND  [conn12652918] command mydb.locations appName: "MongoDB Shell" command: aggregate { aggregate: "locations", pipeline: [ { $sort: { timestamp: -1.0 } }, { $group: { _id: "$internal_id", doc: { $first: "$$ROOT" } } } ], cursor: {}, lsid: { id: UUID("70fea740-9665-4068-a2b5-b7b0f10dcde9") }, $clusterTime: { clusterTime: Timestamp(1590834060, 34), signature: { hash: BinData(0, 9DFB6DBCEE52CFA3A5832DC209519A8E9D6F1204), keyId: 6783976096153993217 } }, $db: "mydb" } planSummary: IXSCAN { timestamp: -1, ms_id: 1 } cursorid:8337712045451536023 keysExamined:1513708 docsExamined:1513708 numYields:11838 nreturned:101 reslen:36699 locks:{ Global: { acquireCount: { r: 24560 } }, Database: { acquireCount: { r: 12280 } }, Collection: { acquireCount: { r: 12280 } } } protocol:op_msg 7677msms

Solution

Mongo aggregations are theoretically descriptive (in that you describe what you want to have happen, and the query optimizer figures out an efficient way of doing that calculation), but in practice many aggregations end up being procedural & not optimized. If you take a look at the procedural aggregation instructions:

{$sort: {timestamp: -1}}: sort all documents by the timestamp.
{$group: {_id: "$internal_id", doc: {$first: "$$ROOT"}}: go through these timestamp sorted documents and then group them by the id. Because everything is sorted by timestamp at this point (rather than id), it'll end up being a decent amount of work.

You can see that this is what mongo is actually doing by taking a look at that log line's query plan: planSummary IXSCAN { timestamp: -1, ms_id: 1 }.

You want to force mongo to come up with a better query plan than that that uses the {internal_id: 1, timestamp: -1} index. Giving it a hint to use this index might work -- it depends on how well it's able to calculate the query plan.

If providing that hint doesn't work, one altenative would be to break this query into 2 parts that each uses an appropriate index.

Find the maximum timestamp for each internal_id. db.my_collection.aggregate([{$group: {_id: "$internal_id", timestamp: {$max: "$timestamp"}}}]). This should use the {internal_id: 1, timestamp: -1} index.
Use those results to find the documents that you actually care about: db.my_collection.find({$or: [{internal_id, timestamp}, {other_internal_id, other_timestamp}, ....]}) (if there are duplicate timestamps for the same internal_id you may need to dedupe).

If you wanted to combine these 2 parts into 1, you can use a self-join on the original collection with a $lookup.