Search code examples
mongodbpymongotornado-motor

Finding number of inserted documents in a bulk insert with duplicate keys


I'm doing a bulk-insert into a mongodb database. I know that 99% of the records inserted will fail because of a duplicate key error. I would like to print after the insert how many new records were inserted into the database. All this is being done in python through the tornado motor mongodb driver, but probably this doesn't matter much.

try:
    bulk_write_result = yield db.collections.probe.insert(dataarray, continue_on_error=True)
    nr_inserts = bulk_write_result["nInserted"]
except pymongo.errors.DuplicateKeyError as e:
    nr_inserts = ????  <--- what should I put here?

Since an exception was thrown, bulk_write_result is empty. Obviously I can (except for concurrency issues) do a count of the full collection before and after the insert, but I don't like the extra roundtrips to the database for just a line in the logfile. So is there any way I can discover how many records were actually inserted?


Solution

  • Regular insert with continue_on_error can't report the info you want. If you're on MongoDB 2.6 or later, however, we have a high-performance solution with good error reporting. Here's a complete example using Motor's BulkOperationBuilder:

    import pymongo.errors
    from tornado import gen
    from tornado.ioloop import IOLoop
    from motor import MotorClient
    
    db = MotorClient()
    dataarray = [{'_id': 0},
                 {'_id': 0},  # Duplicate.
                 {'_id': 1}]
    
    
    @gen.coroutine
    def my_insert():
        try:
            bulk = db.collections.probe.initialize_unordered_bulk_op()
    
            # Prepare the operation on the client.
            for doc in dataarray:
                bulk.insert(doc)
    
            # Send to the server all at once.
            bulk_write_result = yield bulk.execute()
            nr_inserts = bulk_write_result["nInserted"]
        except pymongo.errors.BulkWriteError as e:
            print(e)
            nr_inserts = e.details['nInserted']
    
        print('nr_inserts: %d' % nr_inserts)
    
    
    IOLoop.instance().run_sync(my_insert)
    

    Full documentation: http://motor.readthedocs.org/en/stable/examples/bulk.html

    Heed the warning about poor bulk insert performance on MongoDB before 2.6! It'll still work but requires a separate round-trip per document. In 2.6+, the driver sends the whole operation to the server in one round trip, and the server reports back how many succeeded and how many failed.