Search code examples
amazon-s3publishartifact

Using s3 versioning for maintaining multiple artifacts


Currently we are using S3 buckets as a repository for our artifacts. These artifacts are nothing but jars and zips for different Spark jobs. Let's assume the base directory is s3://our-awesome-jobs/dev. When code changes are pushed to master, artifacts are appended with short commit-ids and pushed to S3. There's one latest file inside the jobs folder which always contains the name of the latest artifact. E.g. for a job called job1, the S3 folder structure will look something like the following:

s3://our-awesome-jobs/dev/job1/artifacts
|
+-- java_job1_023f2d9.jar   # pushed on 10th July
|
+-- java_job1_162ea58.jar   # pushed on 5th July
|
+-- java_job1_81a4cc2.jar   # pushed on 1st July
|
+-- latest                  # contains the entry `java_job1_023f2d9.jar`

I was wondering if we can use S3 versioning mechanism to streamline the storage of artifacts inside the bucket. As per my understanding to allow the newer version of the file to replace the older one, both of them has to be of same name. In that case the commit id information has to be maintained differently. Is there any industry standard of achieving the functionality that I want? Any thoughts or comments are appreciated.


Solution

  • S3 Versioning works best as one of

    • backup/recovery
    • a way to create a list of a set of files which can then be retrieved in future knowing that overwrites don't matter
    • A way to read a file over multiple GET calls and guarantee that even if ovewritten you get a consistent read (S3A will do this in Hadoop 3.3)

    There's no (exposed) way to ask for an artifact by version ID in the s3a connector, nor, AFAIK, in the AWS one. The ASF Hadoop cloud connector team will be happy to take some contribution like a ?version= and ?etag= arg, so that you can add stricter refs. Tests and docs will be expected of course...