multithreading google-app-engine google-cloud-datastore cursors

Multithreading with cursors in Google Datastore

I want to load a lot of data from Google Datastore.

So, Step 1: I run the query (using keysOnly=true) and loop through the cursors, so that each one is pointing to the start of a page of 600 objects. I store the cursors in a local variable.

Step 2: I spin off one thread per cursor, loading and processing 600 objects in each thread.

It is not the usual way that cursors are used.

However, it looks correct to me. The actual query strings in Step 1 and Step 2 are identical. This resembles the usual stateless web use-case where a user may ask for Next, Back, then reload a previous page; there is no need for a cursor to come directly from the result of the previous cursor-query.

I don't want to step through cursors sequentially and then spin off threads in order to parallelize the processing of objects loaded in a given cursor-query, because I want to parallelize the actual IO-intensive querying from the DB.

I am getting some inconsistency in results that seem to involve missed pages and duplicate loading of objects. Is this the correct way to multithread the loading of large amounts of data from Google Datastore? Or if not, what is?

Solution

Ed Davisson, a Google engineer who works on the Google Datastore Client API, answered this. He provided the root cause of the problem and a recommended solution.

He says:

"The cursors returned by a query are only valid for use in the same query. When you switch from the keys-only query [In my Step 1, JF] to the non-keys-only query [In my Step 2, JF], the cursors are no longer applicable....

"If your goal is to split a result set into similar sized chunks, you might want to take a look at QuerySplitter [which is now in version 1beta3, JF]."