Search code examples
amazon-web-servicesaws-lambdadevopsamazon-ecsaws-billing

aws billing breakdown to system components and artifacts


We have been running multi-tier application on aws and using various aws services like ECS, Lambda and RDS. Looking for a solution to map billing items to actual system components, finding the most money spending component etc.

AWS improved its Detailed Cost Usage Reports and have Cost Explorer API however it only break down the billing to services or instances. However per instance breakdown does not bring so much value if you looking for what is the cost of each component. Any solutions/recommendations for this?


Solution

  • Cost Allocation Tags

    You can create a tag such as "system" or "app" and apply it to all of your resources and set the value to the different applications/systems/Components that you wish to track. Then you can go to the billing page, click on "Cost Allocation Tags" and activate that tag that you created.

    Then you can see costs broken down by the different values of that tag. They will show up in Cost Explorer, tag will be one of the filters available. However, I think it takes 24 hours after activation before they will show up.

    If you do need to enforce tag usage, and you have developers that work on multiple components, it's possible to have IAM roles for managing each components, each role is limited to interacting with resources with a specific tag (i.e. they can only modify existing resources with that tag, and they can only create new resources with that tag). A developer can have an IAM user (or you could federate identities, but that's a whole different conversation) and allow them to assume different roles depending on which component they are working on. This has the added benefit of making cross-account management easier. However, it may require a non-trivial IAM overhaul.

    More info on cost allocation tags here: https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/cost-alloc-tags.html

    Divide Cost boundaries by AWS account

    To attack the components that are not taggable such as data transfers, you could build your account strategy around cost boundaries and have a separate account for each cost silo (if that's tenable). That may increase cost, because you'd have to break systems into specific accounts (and therefore specific EC2 Instances).

    When you centralize reporting, monitoring, config management, log analysis, etc. Each application will add a little bit to that cost, but usually you just have to consider that centralization a system in itself and cost it out separately on its own. Obviously, you can have separate monitoring, alerting, reporting, log collection, config management, etc. for each system, but this will cost more overall (both in infrastructure costs and engineering hours). So you would have to prioritize cost visibility versus cost optimization.

    There are still a great deal of capabilities within AWS to connect resources from disparate accounts, and it's not difficult to have a data-layer in one account, and an app-tier in another (though it's not a paradigm I often see).

    Custom Tooling

    Maybe the above are imperfect solutions for your environment, you could use the above as far as they are feasible and write scripts to estimate usage of things that are more difficult to track. For bandwidth, if you had your own EC2 Instances that ran as Forward Proxies or NAT gateways, you could write some outbound data transfer account software. If everything in your VPCs had a route to point to ENIs on these instances, then you could better track outbound transfer by any parameters you choose. This does sound a little fragile to me, and there may be several cases where this isn't tenable from a network perspective, but it's a possibility.

    Similarly, with Cloudwatch metrics, you can use Namespaces, I wasn't able to find any reference to the ability to filter by Cloudwatch Namespaces in Cost Explorer, but it probably would be pretty easy to suss out raw metrics per namespace and estimate costs per namespace. Then you could divide your components in Cloudwatch by namespace. This may lead to some duplication, which may lead to more management effort or increased cost, but that would be the tradeoff for more granular cost visibility.

    Kubernetes

    This may be very pie-in-the-sky for your environment, but it's worth mentioning. If you ran a cluster using EKS or a self-managed cluster on EC2, you can harness the power of that platform, which would allow you provision a base level of compute resources, divide components into namespaces and use built-in or third party tools to grab usage statistics per namespace (or even per workload). This is much more easy to enforce, because you can give developers access to specific namespaces and outliers are generally more obvious. When you know the amount of CPU and Memory each workload uses over time, you can get a pretty good estimate of individual cost patterns by component.

    Of course, you will still have a cost for the k8s management plane, which will be in a cost bucket apart from all of your other applications/systems.

    Istio, while not a simple technology by any means, allows you to collect granular metrics about data egress which you can use to get an idea of how much data transfer costs are being ran up.

    It might be easier to duplicate monitoring in each namespace, since you already have to abstract your monitoring workload to a certain extent to run on k8s at all. However, that still increases management and overall cost, but perhaps less than siloing at the Infrastructure (AWS) layer.

    Summary

    There's not a lot of options I know for getting to the level of granularity and control that you need in AWS. And efforts to this end will probably increase overall cost and management overhead. AWS is rather notorious for it's difficult to estimate cost model. Perhaps look into platforms other than AWS for your workloads that might provide better visibility into component costs.

    It's also difficult to avoid systems that operate centrally and whose cost-per-system is difficult to trace. These include log management, config management, authentication systems, alerting systems, monitoring systems, etc. Generally it's more cost effective and more manageable to centralize these functions for all of your workloads, but then TCO of individual apps becomes difficult. In my experience, most teams write this off as infrastructure cost, and track the cost of an app more with the compute, storage, and AWS service usage data points.