Search code examples
mongodbcursortimeoutpymongo

Cursor not found when using parallel scan of Pymongo


I have a mongo database with a collection of 3.000.000 documents that I process with pymongo. I want to iterate once through all documents without updating the collection. I tried to do that using four threads:

cursors = db[collection].parallel_scan(CURSORS_NUM)
threads = [
    threading.Thread(target=process_cursor, args=(cursor, )) for cursor in cursors
]

for thread in threads:
    thread.start()

for thread in threads:
    thread.join()

And the process cursor function:

def process_cursor(cursor):
    for document in cursor:
        dosomething(document)

After some time of processing documents I receive the error:

  File "extendDocuments.py", line 133, in process_cursor
    for document in cursor:
  File "/usr/local/lib/python2.7/dist-packages/pymongo/command_cursor.py", line 165, in next
    if len(self.__data) or self._refresh():
  File "/usr/local/lib/python2.7/dist-packages/pymongo/command_cursor.py", line 142, in _refresh
    self.__batch_size, self.__id))
  File "/usr/local/lib/python2.7/dist-packages/pymongo/command_cursor.py", line 110, in __send_message
    *self.__decode_opts)
  File "/usr/local/lib/python2.7/dist-packages/pymongo/helpers.py", line 97, in _unpack_response
    cursor_id)
CursorNotFound: cursor id '116893918402' not valid at server

If I use find() instead I can set the timeout to false in order to avoid that. Can I do something similar with the Cursors that I get from the parallel scan?


Solution

  • Currently there's no way to turn off idle timeouts for the cursors returned from parallelCollectionScan. I've opened a feature request:

    https://jira.mongodb.org/browse/SERVER-15042