Search code examples
linuxamazon-web-servicesfilesystemsamazon-cloudwatch

What is AWS Cloudwatch Agent disk_used_percent measuring? It does not match the usage I see with lsblk or df


I have a t4g.large EC2 instance, running Ubuntu 22.04, with a single 30GB storage volume. I have installed and configured the Cloudwatch Agent to monitor disk usage.

Right now, the metrics on Cloudwatch show that the disk is 56% full.

If I run lsblk -f, I see this (I deleted the uuid column for conciseness):

NAME         FSTYPE   FSVER LABEL           FSAVAIL FSUSE% MOUNTPOINTS  
loop0        squashfs 4.0                         0   100% /snap/core20/1699  
loop1        squashfs 4.0                         0   100% /snap/amazon-ssm-agent/5657  
loop2        squashfs 4.0                                   
loop3        squashfs 4.0                         0   100% /snap/lxd/23545  
loop4        squashfs 4.0                         0   100% /snap/core18/2658  
loop5        squashfs 4.0                         0   100% /snap/core18/2636  
loop6        squashfs 4.0                         0   100% /snap/snapd/17885  
loop7        squashfs 4.0                         0   100% /snap/amazon-ssm-agent/6313  
loop8        squashfs 4.0                         0   100% /snap/core20/1740  
nvme0n1                                                    
├─nvme0n1p1  ext4     1.0   cloudimg-rootfs    2.9G    90% / 
└─nvme0n1p15 vfat     FAT32 UEFI              92.4M     5% /boot/efiNAME

If I run df -h, I see this:

Filesystem       Size  Used Avail Use% Mounted on
/dev/root         29G   27G  2.9G  91% /
tmpfs            3.9G     0  3.9G   0% /dev/shm
tmpfs            1.6G  1.1M  1.6G   1% /run
tmpfs            5.0M     0  5.0M   0% /run/lock
/dev/nvme0n1p15   98M  5.1M   93M   6% /boot/efi
tmpfs            782M  8.0K  782M   1% /run/user/1000

I don't understand where 56% could be coming from. Even if the Cloudwatch agent is doing a sum over all of the mount points, it would come out to ~75%, not 56%.

This is my config for the agent:

{
    "agent": {
        "metrics_collection_interval": 60,
        "run_as_user": "root"
    },
    "metrics": {
        "aggregation_dimensions": [
            [
                "InstanceId"
            ]
        ],
        "append_dimensions": {
            "AutoScalingGroupName": "${aws:AutoScalingGroupName}",
            "ImageId": "${aws:ImageId}",
            "InstanceId": "${aws:InstanceId}",
            "InstanceType": "${aws:InstanceType}"
        },
        "metrics_collected": {
            "collectd": {
                "metrics_aggregation_interval": 60
            },
            "disk": {
                "measurement": [
                    "used_percent"
                ],
                "metrics_collection_interval": 60,
                "resources": [
                    "*"
                ]
            },
            "mem": {
                "measurement": [
                    "mem_used_percent"
                ],
                "metrics_collection_interval": 60
            },
            "statsd": {
                "metrics_aggregation_interval": 60,
                "metrics_collection_interval": 30,
                "service_address": ":8125"
            }
        }
    }
}

I tried changing "*" to "/" or "/dev/root" in the resources, and restarted the agent, but it has not made any difference in the reported value.

Edit: I've now deleted a bunch of files and lsblk reports 33% disk usage at the "/" mount point, while cloudwatch says 52%.


Solution

  • I figured it out. The culprit is this part of the config:

    "aggregation_dimensions": [
                [
                    "InstanceId"
                ]
            ],
    

    This means that the agent sends an "aggregate" value to cloudwatch, which is what I was using by accident. To get this aggregate, I navigated through the Metrics in the Cloudwatch GUI like "CWAgent" - "InstanceId" - "disk_used_percent". This reports a set of data points for each point in time - all the results for all the different paths that the agent is reporting on. From there you can select "average", "max", "min", etc. to use this data. I had selected "average".

    What I should have done was navigate through "CWAgent" - "ImageId, InstanceId, InstanceType, device, fstype, path" - "disk_used_percent" for path /. Then I would be looking at only the value for that path, there would only be one sample per time step, and it would match what I see in the terminal.

    Note: If you really want to dive deep, you can check out the collectd config at /etc/collectd/collectd.conf, which has a config for "". This should point you to the path where collectd is storing the df information that the cloudwatch agent is reading.