Search code examples
pythonmongodbmatchpipeline

What happens to a $match term in a pipeline?


I'm a newbie to MongoDB and Python scripts. I'm confused how a $match term is handled in a pipeline.

Let's say I manage a library, where books are tracked as JSON files in a MongoDB. There is one JSON for each copy of a book. The book.JSON files look like this:

{
    "Title": "A Tale of Two Cities",
    "subData":
        {
            "status": "Checked In"
            ...more data here...
        }
}

Here, status will be one string from a finite set of strings, perhaps just: { "Checked In", "Checked Out", "Missing", etc. } But also note also that there may not be a status field at all:

{
    "Title": "Great Expectations",
    "subData":
        {
            ...more data here...
        }
}

Okay: I am trying to write a MongoDB pipeline within a Python script that does the following:

  • For each book in the library:
    • Groups and counts the different instances of the status field

So my target output from my Python script would be something like this:

{ "A Tale of Two Cities"   'Checked In'    3 }
{ "A Tale of Two Cities"   'Checked Out'   4 }
{ "Great Expectations"     'Checked In'    5 }
{ "Great Expectations"     ''    7 }

Here's my code:

mydatabase = client.JSON_DB
mycollection = mydatabase.JSON_all_2

listOfBooks = mycollection.distinct("bookname")
for book in listOfBooks:
    match_variable = {
        "$match": { 'Title': book }
    }
    group_variable = {
        "$group":{
            '_id': '$subdata.status',
            'categories' : { '$addToSet' : '$subdata.status' },
            'count': { '$sum': 1 }
        }
    }
    project_variable = {
        "$project": {
            '_id': 0,
            'categories' : 1,
            'count' : 1
        }
    }
    pipeline = [
        match_variable,
        group_variable,
        project_variable
    ]
    results = mycollection.aggregate(pipeline)
    for result in results:
        print(str(result['Title'])+"  "+str(result['categories'])+"  "+str(result['count']))

As you can probably tell, I have very little idea what I'm doing. When I run the code, I get an error because I'm trying to reference my $match term:

Traceback (most recent call last):
  File "testScript.py", line 34, in main
    print(str(result['Title'])+"  "+str(result['categories'])+"  "+str(result['count']))
KeyError: 'Title'

So a $match term is not included in the pipeline? Or am I not including it in the group_variable or project_variable ?

And on a general note, the above seems like a lot of code to do something relatively easy. Does anyone see a better way? Its easy to find simple examples online, but this is one step of complexity away from anything I can locate. Thank you.


Solution

  • Here's one aggregation pipeline to "$group" all the books by "Title" and "subData.status".

    db.collection.aggregate([
      {
        "$group": {
          "_id": {
            "Title": "$Title",
            "status": {"$ifNull": ["$subData.status", ""]}
          },
          "count": { "$count": {} }
        }
      },
      { // not really necessary, but puts output in predictable order
        "$sort": {
          "_id.Title": 1,
          "_id.status": 1
        }
      },
      {
        "$replaceWith": {
          "$mergeObjects": [
            "$_id",
            {"count": "$count"}
          ]
        }
      }
    ])
    

    Example output for one of the "books":

      {
        "Title": "mumblecore",
        "count": 3,
        "status": ""
      },
      {
        "Title": "mumblecore",
        "count": 3,
        "status": "Checked In"
      },
      {
        "Title": "mumblecore",
        "count": 8,
        "status": "Checked Out"
      },
      {
        "Title": "mumblecore",
        "count": 6,
        "status": "Missing"
      }
    

    Try it on mongoplayground.net.