Search code examples
pythonmongodb-queryquery-optimizationpymongo

Improve MongoDB query in Python


I have a MongoDB database including hiking trails that I visualize in a Jupyter Notebook. Each trail has a start and destination and includes a timestamp when it was updated. So there are multiple entries for one start-destination pair.

I load all distinct trails in a list and iterate over it. For each trail I then want to find out

  1. how many updates there have been
  2. what the most recent date is that the trail was updated.

So I query the distinct timestamps and calculate their length like this:

nrUpdates = len(mycollection.distinct("mytimestamp", {"trailName": trail})

And to get the most recent update I then do:

mycollection.find({"trailName": trail}), {"mytimestamp":1, "_id":0}).sort("timestamp", -1).limit(1)

Is there any way to query the database in a more efficient way? I have to iterate several thousand trails and the performance is currently not high. Of course my queried fields are indexed.


Solution

  • If I understand what you are trying to do:

    1. Number of updates for each "trail" (number of docs for each "trail"?)
    2. Most recent update for each "trail" ("$max" "mytimestamp"?)

    ... and inferring your data model, this aggregation pipeline may be what you are looking for.

    db.mycollection.aggregate([
      {
        "$group": {
          "_id": "$trail",
          "updateCount": {
            "$count": {}
          },
          "mostRecentUpdate": {
            "$max": "$mytimestamp"
          }
        }
      }
    ])
    

    Example output (array of objects like this):

    [
      ...
      {
        "_id": "Trois Pistoles",
        "mostRecentUpdate": ISODate("2023-07-31T04:41:24Z"),
        "updateCount": 30
      },
      ...
    ]
    

    Try it on mongoplayground.net.