Search code examples
kubernetesgoogle-cloud-platformgoogle-cloud-storageelixirdata-export

How to zip objects in an object storage


How would you go about organizing a process of zipping objects that reside an object storage?

For context, our users sometimes request an extraction of their entire data from the app - think of "Downloading Twitter archive" feature of Twitter.

Our users are able to upload files, so the extracted data must contain files stored in a object storage (Google Cloud Storage). The requested data must be packed into a single .zip archive.

A naive approach would look like this:

  1. download all files from object storage on a disk,
  2. zip all files into an archive,
  3. put it .zip back on an object storage,
  4. send a link to download the .zip file back to user.

However, there are multiple disadvantages here:

  1. sometimes files for even single user add up to gigabytes,
  2. if the process of zipping is interrupted, it has to start over.

What's a reasonable way to design a process of generating a .zip archive with user files, that originally reside on an object storage?


Solution

  • Unfortunately, your naive approach is the only way because Cloud Storage offers no compute abilities. Archiving files requires compute, memory, and temporary storage.

    The key item is to choose a service, such as Compute Engine, that can meet your file processing requirements: multi-gig files, fast processing (compression), and high-speed networking.

    Another issue will be the time that it takes to download, zip, and upload. That means using an asynchronous event-based design. Start file processing and notify the user (email, message, web inbox, etc) once the file processing is complete.

    You could make the process synchronous and display a progress bar, but that will complicate the design.