Search code examples
mysqlamazon-web-servicesdatabase-performanceamazon-aurora

Aurora database exceeding capacity due to "/io/file/myisam/kfile" Wait


This morning, one of our Aurora clusters suddenly began experiencing high latency, slower running queries and was being reported as exceeding exceeding capacity - with up to 20 sessions for the db.r5.large instance which has only 2 CPUs.

There we no code changes, no deploy, no background process or any other cause we can identify. The higher latency is intermittent, occurring every 10 minutes and lasting for about as long. The Aurora monitoring isn't helping much, the only change of note being higher latency on the all queries (selects, updates and deletes).

enter image description here

Under Performance Metrics, the cases where usage spikes - we're seeing that of the total 20 sessions, these are attributed almost solely to the io/file/myisam/kfile Wait. Researching online has yielded very little and so I'm somewhat stumped as to what this means, and how to go about getting to the cause of the issue. Looking at the SQL queries ran during spikes, their slow run time appears more caused by the intermittent issue - as opposed to the being the cause of it.

So my question is: can anyone explain what the 'myisam/kfile' Wait is, and how I can use this knowledge to help diagnose the cause of the problem here?

My feeling is that it's one of those rare occurrences where an AWS instance unexplainably goes rogue at a level below which we can directly control and is only solved by spinning up a new instance (even where all else is equal from a configuration and code perspective). All the same, I'd love to better understand the issue here, especially when none of our DB table are MyISAM, all being innoDB.


Solution

    • Is there a table called kfile? How big is it? What operations are being performed?
    • While the problem is occurring, do SHOW FULL PROCESSLIST; to see what is running. That may give a good clue.
    • If the slowlog is turned on, look at it shortly after the problem has subsided; probably the naughty query will be at the end of the list. Publish pt-query-digest path_to_slowlog. The first one or two queries are very likely to be the villains.
    • Check SHOW ENGINE INNODB STATUS;. Near the front will be the "latest deadlock". That may be a clue.
    • In most situations most of the graphs don't provide any useful information. When something does go wrong, it is not obvious which graph to look at. I would look for graphs that look "different" in sync with the problem. The one you show us perhaps indicates that there is a 20-second timeout, and everyone is stuck until they hit that timeout.
    • Do you run any ALTERs? When do backups occur?
    • Are you using MyISAM? Don't. That ENGINE does not allow concurrency, hence could lead to a bunch of queries piling up. More on converting to InnoDB: http://mysql.rjweb.org/doc.php/myisam2innodb