Search code examples
mongodbaggregation-frameworkpymongostring-comparison

Why is Mongo string comparison reading equivalent strings as inequivalent? pymongo


Using pymongo, I'm searching for MongoDB documents in which the string value in one field is not equivalent to the string value in another field. I've tried $cmp, $strcasecmp, $ne, and $eq, but they all keep returning that strings that look equivalent to me are not equivalent.

The pipeline starts as a join between collections with documents like the following examples:

# test collection:
[
 {'_id': '611868136', 'doc_type': 'way'},
 {'_id': '5792648632', 'doc_type': 'node'},
 {'_id': '611868133', 'doc_type': 'node'},
 {'_id': '1', 'doc_type': 'node'}
]

# refer_docs collection:
[
 {'_id': '8483444',
  'refs': [{'ref': '611868136', 'ref_type': 'way'},
           {'ref': '5792648632', 'ref_type': 'node'},
           {'ref': '611868133', 'ref_type': 'way'}],
  'doc_type': 'relation'}
]

Now, here's the aggregation:

import pymongo
mongo_client = MongoClient('localhost:27017')
osm_db = mongo_client.osm
refer_docs_col = osm_db["refer_docs"]

pipeline = [
    { "$unwind" : "$refs" },
    {
        "$lookup" : {
            "from" : "test",
            "localField" : "refs.ref",
            "foreignField" : "_id",
            "as" : "ref_doc"
        }
    },
    { "$match" : { "ref_doc" : { "$ne" : [] } } },
    { "$unwind" : "$ref_doc"},
    { "$match" : { "refs.ref_type" : { "$ne" : "$ref_doc.doc_type" } } },
    { "$project" : { "_id" : 1, "refs" : 1, "ref_doc.doc_type" : 1, 
                     "cmp" : { "$cmp" : [ "refs.ref_type",
                                          "ref_doc.doc_type" ] } } },
    { "$limit" : 1 }
]
[doc for doc in refer_docs_col.aggregate(pipeline)]

I would expect the following document to be returned:

[{'_id': '467676638',
 'refs': {'ref': '611868133', 'ref_type': 'way'},
 'ref_doc': {'doc_type': 'node'},
 'cmp': 1}]

But, it also returns documents that look like this:

[{'_id': '8483444',
  'refs': {'ref': '611868136', 'ref_type': 'way'},
  'ref_doc': {'doc_type': 'way'},
  'cmp': 1},
 {'_id': '8483444',
  'refs': {'ref': '5792648632', 'ref_type': 'node'},
  'ref_doc': {'doc_type': 'node'},
  'cmp': 1},
 {'_id': '8483444',
  'refs': {'ref': '611868133', 'ref_type': 'way'},
  'ref_doc': {'doc_type': 'node'},
  'cmp': 1}]

Notice the compared fields ("refs.ref_type", "ref_doc.doc_type") clearly contain equivalent values in two cases(e.g. "node"), but the comparison operator returns 1 in all cases, indicating the first value is greater than the second. It should return 0, indicating the two values are equivalent.

Ack! Why??


Solution

  • I forgot a couple of dollar signs in the $cmp statement, so I was comparing the literal strings of the field names rather than the field values. So obvious, ugh.

    pipeline = [
        { "$unwind" : "$refs" },
        {
            "$lookup" : {
                "from" : "test",
                "localField" : "refs.ref",
                "foreignField" : "_id",
                "as" : "ref_doc"
            }
        },
        { "$match" : { "ref_doc" : { "$ne" : [] } } },
        { "$unwind" : "$ref_doc"},
    #     { "$match" : { "refs.ref_type" : { "$ne" : "$ref_doc.doc_type" } } },
        { "$project" : { "_id" : 1, "refs" : 1, "ref_doc.doc_type" : 1, 
                         "cmp" : { "$cmp" : [ "$refs.ref_type",
                                              "$ref_doc.doc_type" ] } } },
        { "$match" : { "cmp" : { "$ne" : 0 } } }
    ]
    [doc for doc in refer_docs_col.aggregate(pipeline)]