Search code examples
pythongoogle-app-enginepipeline

Can I keep state across GAE Pipeline API workers?


I've begun creating a MapReduce job with the new Google App Engine Pipeline API, and I've run into a situation where I'd like every worker to have a copy of the same list during runtime.

One option would be to use memcache, but I'm worried that the size of this list might eventually be greater than what I can set with memcache. I think my other option would be to initialize every worker with this list context at runtime, but I can't find any way to do this in the docs and looking at the source code hasn't offered any obvious answers.

Is there a way to add extra parameters into a map reduce function or otherwise inject state into a MapReduce worker context?


Solution

  • There's no official way at the moment. You could probably prepend a task to the MapReduce pipeline to compute and cache the list (in the datastore or blobstore, whichever is most appropriate, plus a copy in memcache). Then have your mapper and/or reducer function do a lazy initialization of a global variable that holds the list, checking first in memcache, and falling back on datastore/blobstore as necessary (and re-caching the list). As new instances are spun up to handle tasks, they'll initialize themselves.

    Assuming the list is fixed at the time the MapReduce starts, competing reads from different instances won't be an issue.