Search code examples
mysqlscalaplayframeworkamazon-auroramariadb-connector-c

MariaDb-failover is running high on CPU and chatty even when there is no traffic


We have an API written in scala 2.12.x and Play framework 2.5.x. The API connects to AWS aurora cluster using MariaDb connector/J 2.5.4 as such jdbc:mysql:aurora://some-aurora-cluster

Functionally everything works fine except we've noticed high CPU usage even when there is no traffic. A bit of research shows the following:

[ec2-user@ip-xxx-xxx-xxx-xxx ~]$ top -H

    …
 6373 root      20   0 4644452 990888  21540 S 14.6 12.6   1:15.68 MariaDb-failove
 6374 root      20   0 4644452 990888  21540 S 13.6 12.6   1:16.11 MariaDb-failove
 6305 root      20   0 4644452 990888  21540 S 13.3 12.6   1:14.31 MariaDb-failove
 6375 root      20   0 4644452 990888  21540 S 12.3 12.6   1:14.59 MariaDb-failove
 6372 root      20   0 4644452 990888  21540 S 11.3 12.6   1:15.78 MariaDb-failove
    …

The above cmd shows a number of MariaDb-failover. I am not sure what it does and why there are multiple of it busy with high CPU usage.

[ec2-user@ip-xxx-xxx-xxx-xxx ~]$ netstat -a | less

…
tcp6       0      0 ip-xxx-xxx-xxx-31.:37446 ip-xxx-xxx-xxx-129:mysql TIME_WAIT
tcp6       0      0 ip-xxx-xxx-xxx-31.:37108 ip-xxx-xxx-xxx-129:mysql TIME_WAIT
tcp6       0      0 ip-xxx-xxx-xxx-31.:37648 ip-xxx-xxx-xxx-129:mysql TIME_WAIT
tcp6       0      0 ip-xxx-xxx-xxx-31.:36934 ip-xxx-xxx-xxx-129:mysql TIME_WAIT
tcp6       0      0 ip-xxx-xxx-xxx-31.:36870 ip-xxx-xxx-xxx-129:mysql TIME_WAIT
tcp6       0      0 ip-xxx-xxx-xxx-31.:37254 ip-xxx-xxx-xxx-129:mysql TIME_WAIT
tcp6       0      0 ip-xxx-xxx-xxx-31.:37902 ip-xxx-xxx-xxx-129:mysql TIME_WAIT
…

There are a lot of TIME_WAIT. Again this is odd since there was no traffic at the time I executed this cmd.

[ec2-user@ip-xxx-xxx-xxx-xxx ~]$ netstat -nat | awk '{print $6}' | sort | uniq -c | sort -n

      1 established)
      1 Foreign
      1 SYN_SENT
      6 LISTEN
     37 ESTABLISHED
    851 TIME_WAIT

There are hundred of TIME_WAIT; and the number is changing every time I executed the cmd.

Does anyone have any insights as to whether this is normal or something I need to worry about?

Please let me know if you have further questions.

========== More info

ps -aux | grep java

Got the PID: 2655

jstack 2655 > threaddump.log

Here is the content (pruned):

2020-06-16 16:44:39
Full thread dump OpenJDK 64-Bit Server VM (25.252-b09 mixed mode):

"MariaDb-failover-5" #276 daemon prio=5 os_prio=0 tid=0x00007f4b0400b000 nid=0x18e7 waiting on condition [0x00007f4b0c6d6000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(Native Method)
    at org.mariadb.jdbc.internal.protocol.AuroraProtocol.loop(AuroraProtocol.java:269)
    at org.mariadb.jdbc.internal.failover.impl.AuroraListener.reconnectFailedConnection(AuroraListener.java:203)
    at org.mariadb.jdbc.internal.failover.thread.FailoverLoop.doRun(FailoverLoop.java:84)
    at org.mariadb.jdbc.internal.failover.thread.TerminableRunnable.run(TerminableRunnable.java:80)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

"MariaDb-failover-4" #275 daemon prio=5 os_prio=0 tid=0x00007f4b0400a000 nid=0x18e6 waiting on condition [0x00007f4afbefd000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(Native Method)
    at org.mariadb.jdbc.internal.protocol.AuroraProtocol.loop(AuroraProtocol.java:269)
    at org.mariadb.jdbc.internal.failover.impl.AuroraListener.reconnectFailedConnection(AuroraListener.java:203)
    at org.mariadb.jdbc.internal.failover.thread.FailoverLoop.doRun(FailoverLoop.java:84)
    at org.mariadb.jdbc.internal.failover.thread.TerminableRunnable.run(TerminableRunnable.java:80)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

"MariaDb-failover-3" #274 daemon prio=5 os_prio=0 tid=0x00007f4b04009000 nid=0x18e5 waiting on condition [0x00007f4afbbfa000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(Native Method)
    at org.mariadb.jdbc.internal.protocol.AuroraProtocol.loop(AuroraProtocol.java:269)
    at org.mariadb.jdbc.internal.failover.impl.AuroraListener.reconnectFailedConnection(AuroraListener.java:203)
    at org.mariadb.jdbc.internal.failover.thread.FailoverLoop.doRun(FailoverLoop.java:84)
    at org.mariadb.jdbc.internal.failover.thread.TerminableRunnable.run(TerminableRunnable.java:80)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

"MariaDb-failover-2" #273 daemon prio=5 os_prio=0 tid=0x00007f4b04008000 nid=0x18e4 waiting on condition [0x00007f4b0c5d5000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(Native Method)
    at org.mariadb.jdbc.internal.protocol.AuroraProtocol.loop(AuroraProtocol.java:269)
    at org.mariadb.jdbc.internal.failover.impl.AuroraListener.reconnectFailedConnection(AuroraListener.java:203)
    at org.mariadb.jdbc.internal.failover.thread.FailoverLoop.doRun(FailoverLoop.java:84)
    at org.mariadb.jdbc.internal.failover.thread.TerminableRunnable.run(TerminableRunnable.java:80)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

"MariaDb-failover-1" #265 daemon prio=5 os_prio=0 tid=0x00007f4b2407e800 nid=0x18a1 waiting on condition [0x00007f4b0c2d4000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(Native Method)
    at org.mariadb.jdbc.internal.protocol.AuroraProtocol.loop(AuroraProtocol.java:269)
    at org.mariadb.jdbc.internal.failover.impl.AuroraListener.reconnectFailedConnection(AuroraListener.java:203)
    at org.mariadb.jdbc.internal.failover.thread.FailoverLoop.doRun(FailoverLoop.java:84)
    at org.mariadb.jdbc.internal.failover.thread.TerminableRunnable.run(TerminableRunnable.java:80)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

"VM Thread" os_prio=0 tid=0x00007f4b540db800 nid=0xf37 runnable 

"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007f4b54069000 nid=0xf35 runnable 

"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007f4b5406a800 nid=0xf36 runnable 

"VM Periodic Task Thread" os_prio=0 tid=0x00007f4b54132800 nid=0xf3e waiting on condition 

JNI global references: 2520

PIDs of LWPs in Hex:

6373 18e5
6374 18e6
6305 18a1
6375 18e7
6372 18e4

========== More info

Our API has 5 different db configurations - 5 different databases. Each has a connection string as such jdbc:mysql:aurora://some-aurora-cluster

Notice that aurora mode is used for better experience of fail-over so that it explains the 5 light-weight processes. But they are chatty, resulting in many TIME_WAIT and possibly the cause of elevated CPU usage.

Has anyone run into this situation before and how did you mitigate it? I still would like to use aurora mode (or something equivalent) so that we don't have to restart the app for when the db fails over.


Solution

  • After several days of searching and researching, I finally delved into the MariaDb driver code, particularly in the area where the thread dump had shown, and kept following the code stack.

    https://github.com/mariadb-corporation/mariadb-connector-j/blob/master/src/main/java/org/mariadb/jdbc/internal/protocol/AuroraProtocol.java

    I've found a reference to a setting retriesAllDown which has a default value of 120. Further reading on MariaDb driver knowledge base page, I've also discovered another setting failoverLoopRetries which has the same default value of 120.

    Here is where you can read more about MariaDb driver settings: https://github.com/mariadb-corporation/mariadb-connector-j/blob/3bc66153b51aca188afc50ff35a0123f16c099ed/src/main/java/org/mariadb/jdbc/util/DefaultOptions.java

    For our team and API, we feel comfortable with a value of 12 (10% of the default) and decide to go with that value for both settings, so here is the modified connection string:

    jdbc:mysql:aurora://some-aurora-cluster?retriesAllDown=12&failoverLoopRetries=12
    

    That eases the CPU usage significantly and still maintains the fail-over capability that we need.

    Hopefully this answer helps someone down the road. I won't mark it as the answer for my original question until it helps at least 10 other people.