Search code examples
amazon-web-servicesdockeramazon-s3amazon-efsdistributed-filesystem

S3 vs EFS propagation delay for distributed file system?


I'm working on a project that utilizes multiple docker containers which all need to have access to the same files for comparison purposes. What's important is that if a file appears visible to one container, then there is minimal time between when it appears visible to other containers.

As an example heres the situation I'm trying to avoid: Let's say we have two files, A and B, and two containers, 1 and 2. File A is both uploaded to the filesystem and submitted for comparison at roughly the same time. Immediately after, the same happens to file B. Soon after File A appears visible to container 1 and file B appears visible to container 2. Due to the way the files propagated on the distributed file system, file B is not visible to container 1 and file A is not visible to container 2. Container 1 is now told to compare file A to all other files and container 2 is told to compare B to all other files. Because of the propagation delay, A and B were never compared to each other.

I'm trying to decide between EFS and S3 to use as the place to store all of these files. Im wondering which would better fit my needs (or if theres a third option I'm unaware of).

The characteristics of the files/containers are: - All files are small text files averaging 2kb in size (although rarely they can be 10 kb) - Theres currently 20mb of files total, but I expect there to be 1gb by the end of the year - These containers are not in a swarm - The output of each comparison are already being uploaded to S3 - Trying to make sure that every file is compared to every other file is extremely important, so the propagation delay is definitely the most important factor

(One last note: If I use end up using S3, I would probably be using sync to pull down all new files put into the bucket)

Edit: To answer Kannaiyan's questions, what I'm trying to achieve is having every file file compared to every other file at least once. I can't exactly say what I'm comparing, but the comparison happens by executing a closed source linux binary that takes in the file you want to compare and the files you want to compare it against (the distributed file system is holding all the files I want to compare against). They need to be in containers for two reasons:

  1. The binary relies heavily upon a specific file system setup, and containerizing it ensures that the file system will always be correct (I know its dumb but again the binary is closed source and theres no way around it)
  2. The binary only runs on linux, and containerizing it makes development easier in terms of testing on local machines.

Lastly the files only accumulate over time as we get more and more submissions. Every files only read from and never modified after being added to the system.


Solution

  • In the end, I decided that the approach I was going for originally was too complicated. Instead I ended up using S3 to store all the files, as well as using DynamoDB to act as a cache for the keys of the most recently stored files. Keys are added to the DynamoDB table only after a successful upload to S3. Whenever a comparison operation runs, the containers sync the desired S3 directory, then check the DynamoDB to see if any files are missing. Due to S3's read-after-write consistency, if any files are missing they can be pulled from S3 without needing to wait for propagation to all the S3 caches. This allows for a pretty much an instantly propagating distributed file system.