Search code examples
pythonmongodbpymongostanford-nlp

How to store Stanza Span in MongoDB collection?


I am trying to add a list of dictionaries (whose name is stanzanerlist) like the following:

stanzanerlist = [{
  "text": "Harry Potter",
  "type": "PER",
  "start_char": 141,
  "end_char": 153
}, {
  "text": "Hogwarts",
  "type": "LOC",
  "start_char": 405,
  "end_char": 413
}, {
  "text": "JK Rowling",
  "type": "PER",
  "start_char": 505,
  "end_char": 515
}]

as a field in a MongoDB document in a collection.

I am inserting the whole document as follows with stanzanerlist as the last item in mongodocument:

mongodocument = {
        "_id": urlid,
        "source": sourcename,
        "stanzadoc": stanzadoc.to_serialized(),
        "stanzaver": stanzaver,
        # "timestamp": datetime.now(tzinfo),
        "timestamp": datetime.now(
            tz=pytz.timezone(cfgdata["timezone"]["name"])
        ),
        "stanzanerlist": stanzanerlist,
    }
try:
        mdbrc = mdbcoll.insert_one(
            mongodocument
        )  # insert fails if URL/_ID already exists
        return mdbrc
except pymongo.errors.DuplicateKeyError:
        # manage the record update
        print(f"Article {urlid} already exists!")

but while all other fields work well, the addition of stanzanerlist gives the following error:

cannot encode object: {
  "text": "Harry Potter",
  "type": "PER",
  "start_char": 141,
  "end_char": 153
}, of type: <class 'stanza.models.common.doc.Span'>

and I'm not able to understand if and how I could achieve that addition.


Solution

  • pymongo doesn't natively know how to convert <class 'stanza.models.common.doc.Span'> types to an acceptable BSON data type.

    You could "teach" pymongo how to do the proper conversion/encoding using a custom bson.codec_options.TypeEncoder and then pymongo would automatically perform type conversions as it does for other types. Or, you could do the conversion/encoding each time yourself before storing the Span in your MongoDB collection.

    Fortunately, Stanford NLP Stanza has convenience methods for type conversions. <class 'stanza.models.common.doc.Span'> has a to_dict method that will convert the type to type Dict, which pymongo does know how to encode.

    So, in your code snippet, you could change the mongodocument assignment of "stanzanerlist" to:

    "stanzanerlist": [stan.to_dict() for stan in stanzanerlist]
    

    ... and then each <class 'stanza.models.common.doc.Span'> will be converted to a Dict and pymongo should be able to store it.