Search code examples
amazon-web-servicesquery-optimizationdatabase-performanceamazon-aurora

mysql - Wait event "io/table/sql/handler" - cpu spike


I am seeing lot of this wait event in my AWS Aurora MySQL (8.0.mysql_aurora.3.04.0) and the documentation says "this event indicate an increase in workload activity. Increased activity means increased I/O", which causes spike in CPU usage.

enter image description here

Note, the underlying query is ONLY a select statement running against the reader. we are an api based weather application and the parameters under "IN" clause can range from 1000 to 5000 [max_allowed_packets = 1GB] . Query scans the corresponding table (250GB) and uses proper index to retrieve the records and complete within seconds, sometime milliseconds. I am new to mysql and looking for some help on how to avoid/reduce the cpu spike by tweaking any configuration parameters.

PS: I am seeing spike in the following cloudwatch metrics on the DB Cluster

AuroraSlowConnectionHandleCount
AuroraReplicaLag
DBLoadNonCPU
ReadIOPS
ReadLatency
ReadThroughput
SelectLatency

edit: Query and plan -- This table is not partitioned and returns around 200 rows( < 1% of total table rows). No DML activity on this db since its a reader instance.

CREATE TABLE `TABLE_NAME` (
  `DATETIMES` bigint unsigned NOT NULL,
  `S_ID` int unsigned NOT NULL,
  `V_ID` smallint unsigned NOT NULL,
  `NOS` int unsigned DEFAULT NULL,
  `DELTA1` float DEFAULT NULL,
  `DELTA2` int unsigned DEFAULT NULL,
  `FLAGS` tinyint unsigned DEFAULT NULL,
  `YEARS` float DEFAULT NULL,
  PRIMARY KEY (`S_ID`,`DATETIMES`,`V_ID`),
  KEY `DATETIMES` (`DATETIMES`,`S_ID`,`V_ID`,`NOS`,`DELTA1`,`DELTA2`,`FLAGS`,`YEARS`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

 SELECT a.S_ID, a.V_ID, a.DATETIMES, a.YEARS, a.NOS
        FROM DB_NAME.TABLE_NAME a
        INNER JOIN (
            SELECT S_ID, V_ID, max(DATETIMES) DATETIMES
            FROM DB_NAME.TABLE_NAME  force index(DATETIMES) 
            WHERE DATETIMES >= 202310161117 and DATETIMES <= 202310161317 and S_ID in (1,2,5,9,....,4890)
            GROUP BY S_ID, V_ID
        ) b ON a.S_ID = b.S_ID AND a.V_ID = b.V_ID and a.DATETIMES = b.DATETIMES
        
-> Limit: 200 row(s)  (cost=687820.47 rows=200) (actual time=621.565..622.059 rows=71 loops=1)
    -> Nested loop inner join  (cost=687820.47 rows=580086) (actual time=621.564..622.053 rows=71 loops=1)
        -> Filter: (b.DATETIMES is not null)  (cost=0.11..65262.18 rows=580086) (actual time=621.543..621.565 rows=71 loops=1)
            -> Table scan on b  (cost=2.50..2.50 rows=0) (actual time=0.001..0.010 rows=71 loops=1)
                -> Materialize  (cost=2.50..2.50 rows=0) (actual time=621.542..621.556 rows=71 loops=1)
                    -> Table scan on <temporary>  (actual time=0.003..0.014 rows=71 loops=1)
                        -> Aggregate using temporary table  (actual time=621.498..621.515 rows=71 loops=1)
                            -> Filter: ((TABLE_NAME.DATETIMES >= 202310161117) and (TABLE_NAME.DATETIMES <= 202310161317) and (TABLE_NAME.S_ID in (1,2,5,9,....,4890)))  (cost=238947.80 rows=580086) (actual time=0.377..620.724 rows=713 loops=1)
                                -> Covering index range scan on TABLE_NAME using DATETIMES over (202310161117 <= DATETIMES <= 202310161317 AND 1 <= S_ID <= 4890)  (cost=238947.80 rows=1160172) (actual time=0.021..520.516 rows=738696 loops=1)
        -> Single-row index lookup on a using PRIMARY (S_ID=b.S_ID, DATETIMES=b.DATETIMES, V_ID=b.V_ID)  (cost=0.97 rows=1) (actual time=0.007..0.007 rows=1 loops=71)

Appreciate your time!!

Generate query plan for the underlying query and confirmed its not doing table scan.


Solution

  • Your large list S_ID IN (large, list) induces the query planner to scan every row in the time range in the index you forced, rather than somehow selecting just the rows you need. That takes CPU and IO. It's real work and you see those waits because sometimes the database server takes time to finish it.

    If this were my project I'd trust the query planner and stop forcing the choice of index. It's possible the PK's index (the clustered index) will be the right choice for some or all those long-list queries. And, those queries do take time. Those waits are due to that time. No weird index voodoo is going to change that.

    If you do bulk table loads, do ANALYZE TABLE TABLE_NAME to update the optimizer statistics afterward. That may, or may not, help the query planner choose the best way to satisfy the query. (And do it once if you've never done it.)

    It might be possible to refactor your table design, especially if you have a limited set of those long lists your application uses. But only you know how that part of your data model