Is it normal for EC2 instance storage disks to perform worse and worse over a few days? do they have IOPS quotas?
I launched a C5D.large instance a few days ago. Its job is to download a ZIP with 500 x 1mb text files (unzipped size), unzip them to disk and zip into 500 individual zips.
TLDR: it reads and writes files using the instance storage.
In synthetic tests it looked fishy:
dd if=/dev/zero of=/mnt/testfile bs=1G count=1
Day 1: 500MB/s
Day 2: 120MB/s
Day 3 - 4: 40MB/s
Wasn't really an issue, because while working it had ~0 iowait and it was maxing out the CPU anyway:
The problem is that after 2 days of running it went to this:
The curious thing is that DD and HDPARM look okay?
# sudo hdparm -Tt /dev/nvme1n1
/dev/nvme1n1:
Timing cached reads: 15170 MB in 1.99 seconds = 7612.25 MB/sec
Timing buffered disk reads: 392 MB in 3.01 seconds = 130.24 MB/sec
# dd if=/dev/zero of=/mnt/testfile bs=1G count=1
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 29.9148 s, 35.9 MB/s
Later edit: this is also happening on an EBS volume, with the difference that it doesn't recover after rebooting:
With an EBS, after launching or modifying the volume, it works great for a few hours, then the performance drops dramatically for the same workload. Then it even struggles to sustain 50% of the initial throughput, to the point where I can barely log in and SU as root.
Looks like every EC2 instance type is capped at a certain bandwidth/IOPS when working with disks: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-optimized.html.