I have a t4g.large EC2 instance, running Ubuntu 22.04, with a single 30GB storage volume. I have installed and configured the Cloudwatch Agent to monitor disk usage.
Right now, the metrics on Cloudwatch show that the disk is 56% full.
If I run lsblk -f
, I see this (I deleted the uuid column for conciseness):
NAME FSTYPE FSVER LABEL FSAVAIL FSUSE% MOUNTPOINTS
loop0 squashfs 4.0 0 100% /snap/core20/1699
loop1 squashfs 4.0 0 100% /snap/amazon-ssm-agent/5657
loop2 squashfs 4.0
loop3 squashfs 4.0 0 100% /snap/lxd/23545
loop4 squashfs 4.0 0 100% /snap/core18/2658
loop5 squashfs 4.0 0 100% /snap/core18/2636
loop6 squashfs 4.0 0 100% /snap/snapd/17885
loop7 squashfs 4.0 0 100% /snap/amazon-ssm-agent/6313
loop8 squashfs 4.0 0 100% /snap/core20/1740
nvme0n1
├─nvme0n1p1 ext4 1.0 cloudimg-rootfs 2.9G 90% /
└─nvme0n1p15 vfat FAT32 UEFI 92.4M 5% /boot/efiNAME
If I run df -h
, I see this:
Filesystem Size Used Avail Use% Mounted on
/dev/root 29G 27G 2.9G 91% /
tmpfs 3.9G 0 3.9G 0% /dev/shm
tmpfs 1.6G 1.1M 1.6G 1% /run
tmpfs 5.0M 0 5.0M 0% /run/lock
/dev/nvme0n1p15 98M 5.1M 93M 6% /boot/efi
tmpfs 782M 8.0K 782M 1% /run/user/1000
I don't understand where 56% could be coming from. Even if the Cloudwatch agent is doing a sum over all of the mount points, it would come out to ~75%, not 56%.
This is my config for the agent:
{
"agent": {
"metrics_collection_interval": 60,
"run_as_user": "root"
},
"metrics": {
"aggregation_dimensions": [
[
"InstanceId"
]
],
"append_dimensions": {
"AutoScalingGroupName": "${aws:AutoScalingGroupName}",
"ImageId": "${aws:ImageId}",
"InstanceId": "${aws:InstanceId}",
"InstanceType": "${aws:InstanceType}"
},
"metrics_collected": {
"collectd": {
"metrics_aggregation_interval": 60
},
"disk": {
"measurement": [
"used_percent"
],
"metrics_collection_interval": 60,
"resources": [
"*"
]
},
"mem": {
"measurement": [
"mem_used_percent"
],
"metrics_collection_interval": 60
},
"statsd": {
"metrics_aggregation_interval": 60,
"metrics_collection_interval": 30,
"service_address": ":8125"
}
}
}
}
I tried changing "*" to "/" or "/dev/root" in the resources, and restarted the agent, but it has not made any difference in the reported value.
Edit: I've now deleted a bunch of files and lsblk reports 33% disk usage at the "/" mount point, while cloudwatch says 52%.
I figured it out. The culprit is this part of the config:
"aggregation_dimensions": [
[
"InstanceId"
]
],
This means that the agent sends an "aggregate" value to cloudwatch, which is what I was using by accident. To get this aggregate, I navigated through the Metrics in the Cloudwatch GUI like "CWAgent" - "InstanceId" - "disk_used_percent". This reports a set of data points for each point in time - all the results for all the different paths that the agent is reporting on. From there you can select "average", "max", "min", etc. to use this data. I had selected "average".
What I should have done was navigate through "CWAgent" - "ImageId, InstanceId, InstanceType, device, fstype, path" - "disk_used_percent" for path /. Then I would be looking at only the value for that path, there would only be one sample per time step, and it would match what I see in the terminal.
Note: If you really want to dive deep, you can check out the collectd config at /etc/collectd/collectd.conf
, which has a config for "". This should point you to the path where collectd is storing the df information that the cloudwatch agent is reading.