I am wondering the best way to achieve de-duplicated (single instance storage) file storage within Amazon S3. For example, if I have 3 identical files, I would like to only store the file once. Is there a library, api, or program out there to help implement this? Is this functionality present in S3 natively? Perhaps something that checks the file hash, etc.
I'm wondering what approaches people have use to accomplish this.
You could probably roll your own solution to do this. Something along the lines of:
To upload a file:
To upload subsequent files:
To read a file:
You could also make this technique more efficient by uploading files in fixed-size blocks - and de-duplicating, as above, at the block level rather than the full-file level. Each file in the virtual file system would then contain one or more hashes, representing the block chain for that file. That would also have the advantage that uploading a large file which is only slightly different from another previously uploaded file would involve a lot less storage and data transfer.