Search code examples
pythongoogle-app-enginepython-2.7google-cloud-storageblobstore

How to write to cloudstorage using task chaining, The blobstore Files API did a nice job


Using the blobstore File API, I could write very very large blobfiles:

  • create a blobfile
  • read data from the datastore and write (append) to a blobstore file
  • pass the datastore page cursor and the blobstore file to the (next) task
  • ....use other tasks for the same purpose
  • and finalize the blobstore file

Now with GAE GCS client I cannot append and finalize. How to write very large files to GCS without compose. Compose is not part of the GCS client. The Files API still works fine, but has been deprecated.

Below the blobstore solution using task chaining:

class BlobData(webapp2.RequestHandler):

    def post(self):

        page = int(self.request.get('page', default_value='0'))
        data = Data.get_data(.....)

        blob_file = self.request.get('blobfile', default_value='none')
        if blob_file == 'none':
            file_name = files.blobstore.create(mime_type='text/...',
                                               _blobinfo_uploaded_filename='data....txt')
        else:
            data.with_cursor(self.request.get('cursor'))

        count = 0  # page lines counter
        with files.open(blob_file, 'a') as f:
            for each in data.fetch(page):
                f.write(each)
                count += 1

        if count >= page:
            cursor = data.cursor()
            taskqueue.add(url='/blobdata', queue_name='blobdata', countdown=10, method='POST',
                          params={'page': page, 'cursor': cursor, 'blobfile': blob_file},
                          headers={'X-AppEngine-FailFast': 'True'})
        else:  # no data left
            files.finalize(blob_file)

Solution

  • In the Java client, we can serialize the reading channel (the equivalent of a buffer in the Python client), and pass it to another task to continue writing in the same file. See the Java doc for more info :

    A readable byte channel for reading data to Google Cloud Storage. Implementations of this class may buffer data internally to reduce remote calls.

    This class is Serializable, which allows for reading part of a file, serializing the GcsInputChannel deserializing it, and continuing to read from the same file from the same position.

    I do not know if the buffers returned by the Python GCS client could be serializable, I did not find any info in the doc, but it might be worth checking.

    If that's not possible, then use composition. Do not worry about the fact that composition is not available in the GCS client, you can always use the standard Cloud Storage API library from App Engine. The API documentation has a compose example in Python here. It looks like this :

    composite_object_resource = {
            'contentType': 'text/plain',  # required
            'contentLanguage': 'en',
            'metadata': {'my-key': 'my-value'},
    }
    compose_req_body = {
            'sourceObjects': [
                    {'name': source_object_name_1,
                     'objectPreconditions': {'ifGenerationMatch': source_generation_1}},
                    {'name': source_object_name_2,
                     'objectPreconditions': {'ifGenerationMatch': source_generation_2}}],
            'destination': composite_object_resource
    }
    req = client.objects().compose(
            destinationBucket=bucket_name,
            destinationObject=composite_object_name,
            body=compose_req_body)
    resp = req.execute()
    print json.dumps(resp, indent=2)