I have a MongoDB database including hiking trails that I visualize in a Jupyter Notebook. Each trail has a start and destination and includes a timestamp when it was updated. So there are multiple entries for one start-destination pair.
I load all distinct trails in a list and iterate over it. For each trail I then want to find out
So I query the distinct timestamps and calculate their length like this:
nrUpdates = len(mycollection.distinct("mytimestamp", {"trailName": trail})
And to get the most recent update I then do:
mycollection.find({"trailName": trail}), {"mytimestamp":1, "_id":0}).sort("timestamp", -1).limit(1)
Is there any way to query the database in a more efficient way? I have to iterate several thousand trails and the performance is currently not high. Of course my queried fields are indexed.
If I understand what you are trying to do:
"trail"
(number of docs for each "trail"
?)"trail"
("$max"
"mytimestamp"
?)... and inferring your data model, this aggregation pipeline may be what you are looking for.
db.mycollection.aggregate([
{
"$group": {
"_id": "$trail",
"updateCount": {
"$count": {}
},
"mostRecentUpdate": {
"$max": "$mytimestamp"
}
}
}
])
Example output (array of objects like this):
[
...
{
"_id": "Trois Pistoles",
"mostRecentUpdate": ISODate("2023-07-31T04:41:24Z"),
"updateCount": 30
},
...
]
Try it on mongoplayground.net.