Search code examples
amazon-web-servicesamazon-s3amazon-cloudfront

AWS S3 Standard Infrequent Access vs Reduced Redundancy storage class when coupled with CloudFront?


I'm using CloudFront to cache and distribute all of my thumbnails currently stored on S3 in Standard storage class. Since CloudFront caches originals and accesses them only every 24 hours, it makes sense to use a cheaper storage class than Standard: either Standard Infrequent Access (IA) or Reduced Redundancy (RR). But I'm not sure which one would be more suitable and cost effective.

Standard-IA has the cheapest storage among all (58% cheaper than Standard class and 47% cheaper than RR), but 60% more expensive requests than both Standard and RR. However, all files under 128kb stored in Standard-IA class are rounded to 128kb when calculating cost, which would apply to most of my thumbnail images.

Meanwhile, storage in RR class is only 20% cheaper than Standard, but its request cost is 60% cheaper than that of Standard-IA.

I'm unsure which one would be most cost effective in practice and would appreciate anyone with experience using both to give some feedback.


Solution

  • There's a problem with the premise of your question. The fact that CloudFront may cache your objects for some period of time actually has little relevance when selecting an S3 storage class.

    REDUCED_REDUNDANCY is sometimes less expensive¹ because S3 stores your data on fewer physical devices, reducing the reliability somewhat in exchange for lower pricing... and in the event of failures, the object is statistically more likely to be lost by S3. If S3 loses the object because of the reduced redundancy, CloudFront will at some point begin returning errors.

    The deciding factor in choosing this storage class is whether the object is easily replaced.

    Reduced Redundancy Storage (RRS) is an Amazon S3 storage option that enables customers to reduce their costs by storing noncritical, reproducible data at lower levels of redundancy than Amazon S3’s standard storage. It provides a cost-effective, highly available solution for distributing or sharing content that is durably stored elsewhere, or for storing thumbnails, transcoded media, or other processed data that can be easily reproduced.

    https://aws.amazon.com/s3/reduced-redundancy/

    STANDARD_IA (infrequent access) is less expensive for a different reason: the storage savings are offset by retrieval charges. If an object is downloaded more than once per month, the combined charge will exceed the cost of STANDARD. It is intended for objects that will genuinely be accessed infrequently. Since CloudFront has multiple edge locations, each with its own independent cache,² whether an object is "currently stored in" CloudFront is not a question with a simple yes/no answer. It is also not possible to "game the system" by specifying large Cache-Control: max-age values. CloudFront has no charge for its cache storage, so it's only sensible that an object can be purged from the cache before the expiration time you specify. Indeed, anecdotal observations confirm what the docs indicate, that objects are sometimes purged from CloudFront due to a relative lack of "popularity."

    The deciding factor in choosing this storage class is whether the increased data transfer (retrieval) charges will be low enough to justify the storage charge savings that they offset. Unless the object is expected to be downloaded less than once or twice a month, this storage class does not represent a cost savings.

    Standard/Infrequent Access should be reserved for things you really don't expect to be needed often, like tarballs and database dumps and images unlikely to be reviewed after they are first accessed, such as (borrowing an example from my world) a proof-of-purchase/receipt scanned and submitted by a customer for a rebate claim. Once the rebate has been approved, it's very unlikely we'll need to look at that receipt again, but we do need to keep it on file. Hello, Standard_IA. (Note that S3 does this automatically for me, after the file has been stored for 30 days, using a lifecycle policy on the bucket).

    Standard - IA is ideally suited for long-term file storage, older data from sync and share, backup data, and disaster recovery files.

    https://aws.amazon.com/s3/faqs/#sia

    Side note: one alternative mechanism for saving some storage cost is to gzip -9 the content before storing, and set Content-Encoding: gzip. I have been doing this for years with S3 and am still waiting for my first support ticket to come in reporting a browser that can't handle it. Even content that is allegedly already compressed -- such as .xlsx spreadsheets -- will often shrink a little bit, and every byte you squeeze out means slightly lower storage and download bandwidth charges.

    Fundamentally, if your content is easily replaceable, such as resized images where you still have the original... or reports that can easily be rerun from source data... or content backed up elsewhere (AWS is essentially always my first choice for cloud services, but I do have backups of my S3 assets stored in another cloud provider's storage service, for example)... then reduced redundancy is a good option.


    ¹ REDUCED_REDUNDANCY is sometimes less expensive only in some regions as of late 2016. Prior to that, it was priced lower than STANDARD, but in an odd quirk of the strange world of competitive pricing, as a result of S3 price reductions announced in November, 2016, in some AWS regions, the STANDARD storage class is now slightly less expensive than REDUCED_REDUNDANCY ("RRS"). For example, in us-east-1, Standard was reduced from $0.03/GB to $0.023/GB, but RRS remained at $0.024/GB... leaving no obvious reason to ever use RRS in that region. The structure of the pricing pages leaves the impression that RRS may no longer be considered a current-generation offering by AWS. Indeed, it's an older offering than both STANDARD_IA and GLACIER. It is unlikely to ever be fully deprecated or eliminated, but they may not be inclined to reduce its costs to a point that lines up with the other storage classes if it's no longer among their primary offerings.

    ² "CloudFront has multiple edge locations, each with its own independent cache" is still a technically true statement, but CloudFront quietly began to roll out and then announced some significant architectural changes in late 2016, with the introduction of the regional edge caches. It is now, in a sense, "less true" that the global edge caches are indepenent. They still are, but it makes less of a difference, since CloudFront is now a two-tier network, with the global (outer tier) edge nodes sometimes fetching content from the regional (inner tier) edge nodes, instead of directly from the origin server. This should have the impact of increasing the likelihood of an object being considered to be "in" the cache, since a cache miss in the outer tier might be transformed into a hit by the inner tier, which is also reported to have more available cache storage space than some or all of the outer tier. It is not yet clear from external observations how much of an impact this has on hit rates on S3 origins, as the documentation indicates the regional edges are not used for S3 (only custom origins) but it seems less than clear that this universally holds true, particularly with the introduction of Lambda@Edge. It might be significant, but as of this writing, I do not believe it to have any material impact on my answer to the question presented here.