node.js mongodb amazon-web-services amazon-s3 cloud-storage

Handling file uploads and storage in a node.js app using AWS S3

I am busy with a ToDo-like app where I want users to be able to add attachments to tasks.

I am struggling with the architecture of my app more than with the code.

For my frontend I am using Vuejs, with Nodejs as backend and MongoDB for my database, which I'm considering to host on Heroku. I was thinking to use AWS S3 for storing the attachments for my tasks.

I am unsure if I should do file uploads via my Node server to S3, or if I should do the uploads via pre-signed URL's. Also I am unsure which is the best way to download the attachments from S3, I was thinking pre-signed URL's would be the best way to do this.

My main confusion is how to keep an index of all attachments of a task. Would storing an index in MongoDB that is related to my Task model be the best way to do this? Also what conventions are there as to what meta-data should be stored?

Lastly, I was wondering if there are any conventions as to how to organize the files uploaded to S3. Is it fine to just save the file under the Task's database ID? Should I change the file's name at all?

Solution

Store your attachments in S3. I would recommend you keep a separate bucket in S3 for attachments and keep track of those files in a MongoDB collection called Attachments.

For each file you keep the following document :

{
  "source_name" : "helloworld.txt"
  "s3_url" : "https://bucket-name.s3-eu-west-1.amazonaws.com/A591A6D40BF420404A011733CFB7B190D62C65BF0BCDA32B57B277D9AD9F146E"
  "sha256" : "A591A6D40BF420404A011733CFB7B190D62C65BF0BCDA32B57B277D9AD9F146E"
  "uploaded" : "Mon May 11 2020 13:40:28" #  Alway UTC time
  "size" : 12
}

The source_name is the name of the file that was uploaded. The s3_url is the location in S3. This should be a non-public bucket. This is the sha256 checksum of the file that you generate. You also store this as a separate entity. Finally the date uploaded and the size in bytes.

Why go to the overhead of a checksum? It is more secure you automatically dedupe your files and you can easily detect uploaded files that you already have in your collection.

This means you can find files by checksum and name quickly and you can add other discriminator fields in the future.

Upload and download should be managed by your application. Your store the _id field of this document in your Task documents so attachments can be retrieved fast.

A final optimisation is to embed this document in your Task document and save the complexity and overhead of an additional collection. Do this if the ratio of attachments to Tasks is low.