google-app-engine google-cloud-storage deferred

Writing to a File using App Engine Deferred

I have a task that I would like to kick off using App Engine's cron job scheduler. To build the handler for this task, I've been looking off of an App Engine article that describes how to use a deferred to ensure that long-running tasks don't time out.

Note that this article talks about deferreds in the context of updating model entities. However, I would like to use it to continuously write to a file that will be hosted on Google Cloud Storage (GCS).

To compensate, I had thought to pass the the file stream that I am working with instead of the Cursor object as they do in the UpdateSchema definition in the article. However, in production (with 10k+ entries to write), I imagine that this file/file stream will be too big to pass around.

As such, I'm wondering if it would just be a better idea to write a portion of the file, save it to GCS, and then retrieve it when the deffered runs again, write to it, save it, etc -- or do something else entirely. I'm not quite sure what is typically done to accomplish App Engine tasks like this (i.e., where the input location is the datastore, but the output location is somewhere else).

Edit: if it makes a difference, I'm using Python

Solution

I suspect that the file stream will be closed before your next task gets it, and that it won't work.

You can certainly do the following:

Pass the GCS filename to the task
Read in the whole file.
Create a new file that has the old data and whatever new data you want to add.

Note that you can't append to a file in GCS, so you have to read in the whole file and then rewrite it.

If your files are large, you might be better off storing smaller files and coming up with a suitable naming scheme, e.g., adding an index to the filename.