Search code examples
ruby-on-railspdfziparchive

Archiving large number of PDF files in a ZIP


I have a Ruby on Rails 5.1 application where I am generating PDF files that represent records in a database.

I need to archive these PDF files so they can be stored outside the application.

This is mostly a one-time event, so I don't need continual syncing.

I have working code that converts each record to a PDF file, adds that file to a ZIP file built in memory, and then returns that ZIP to the user as a download.

This works, but if you have lots of records records, the web server will timeout, so I need to figure out a better approach that doesn't hog all the server memory.

The ZIP file could potentially be 200MB in size, with 10,000+ PDF files inside.

I host the applications on their own containers, so I can access the server file directory if necessary, but each re-deploy or container shutdown would wipe it.

The approach I'm thinking about implementing is:

  1. Run the archive in a background processor that sends the user an email when finished with a download link.
  2. Breaking up records into a separate ZIP for every 100 records or so (to avoid memory issues and individual files that are too massive).
  3. Store the ZIP files in the container's directory for 24 hours and let users download the archives via their email link (they would have a separate link for each ZIP file).
  4. Wipe the tmp ZIP files on the container after 24 hours.

This is the first time I've done something large-scale like this; is this approach reasonable? What would be a better way to accomplish the goal of archiving PDF files off the server?


Solution

  • Your approach is reasonable. Some remarks:

    1. 100 records per file with expected 10000 records per query means the user will have to download and handle 100 files manually, this is not very user-friendly. I'd look into producing large file not in-memory or streaming. Once all file sizes are known - you can use nginx mod_zip to create non-compressed zip file on-the-fly (this may be useful if records in export can occur in multiple different exports).
    2. Depending on how much time this operation takes - it may be desirable to provide some kind of progress during creation, so that user will not launch several other exports while thinking that first one did not work.
    3. export should survive app deploy/restart, also be idempotent