java garbage-collection jvm tomcat7 concurrent-mark-sweep

Java CMS GC, GC threads taking CPU when system is idle

We have a web application in tomcat 7, JDK 7, Amazon Linux. This is what we have for the GC configuration:

-XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled"

We do not have "PrintGCDetails" enabled.

This is what is printed out in the gc.log file every few seconds (for 48 hours straight when the application at this time is under 0 load - absolutely no activity and even then the following gets printed out and the CPU taken by these threads is at 15%) (please see below for context):

940568.925: [GC 7425641K(8277888K), 0.2201750 secs]
940572.415: [GC 7425641K(8277888K), 0.1782950 secs]
940574.101: [GC 7425641K(8277888K), 0.1736550 secs]
940577.545: [GC 7425647K(8277888K), 0.1779190 secs]

And we see that the following GC threads take up CPU (around 15% overall) even though:

there is no activity in the system,
no other threads are active except these and this continues for 48 hours and the system at this time was under 0 load.
the system was under 0 load also for the previous 24 hours and then these threads became active
when these threads were active and taking up CPU, the preceding lines were printed in the gc.log
this continued for 2 days and when we finally put load by invoking the web application in this JVM for 5 minutes and then it all stopped. We noticed that these threads are no longer taking CPU (they are taking <1 %) now. The problem seems to have resolved itself once a FULL GC happens. The FULL GC happened after the system was put under load.
In short, the GC threads continued taking 15% CPU when the system was under 0 load for 48 hours.

These are the threads:

"Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x00007fad9402e000 nid=0x629d runnable
"Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x00007fad94030000 nid=0x629e runnable
"Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x00007fad94031800 nid=0x629f runnable
"Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x00007fad94033800 nid=0x62a0 runnable
"Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x00007fad94035800 nid=0x62a1 runnable
"Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x00007fad94037000 nid=0x62a2 runnable
"Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x00007fad94039000 nid=0x62a3 runnable
"Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x00007fad9403b000 nid=0x62a4 runnable
"Gang worker#8 (Parallel GC Threads)" prio=10 tid=0x00007fad9403d000 nid=0x62a5 runnable
"Gang worker#9 (Parallel GC Threads)" prio=10 tid=0x00007fad9403e800 nid=0x62a6 runnable
"Gang worker#10 (Parallel GC Threads)" prio=10 tid=0x00007fad94040800 nid=0x62a7 runnable
"Gang worker#11 (Parallel GC Threads)" prio=10 tid=0x00007fad94042800 nid=0x62a8 runnable
"Gang worker#12 (Parallel GC Threads)" prio=10 tid=0x00007fad94044000 nid=0x62a9 runnable
"Concurrent Mark-Sweep GC Thread" prio=10 tid=0x00007fad9418b000 nid=0x62ae runnable
"Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x00007fad94183000 nid=0x62aa runnable
"Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x00007fad94185000 nid=0x62ab runnable
"Gang worker#2 (Parallel CMS Threads)" prio=10 tid=0x00007fad94187000 nid=0x62ac runnable
"Gang worker#3 (Parallel CMS Threads)" prio=10 tid=0x00007fad94188800 nid=0x62ad runnable

Question 1: Why would the gc log be filled with the following when these threads are taking 15% CPU?

940568.925: [GC 7425641K(8277888K), 0.2201750 secs]
940572.415: [GC 7425641K(8277888K), 0.1782950 secs]

Question 2: Does the above log indicate a failure in CMS (mark, sweep etc.)?

Question 3: Why did it resolve itself when we introduced moderate load on the web application and a FULL GC happens?

Yes, I will try to replicate the issue in another environment with printGCDetails enabled and return with information or a resolution - in case anybody has seen this earlier, please add to the conversation.

Edit 1: This is the log when "PrintGCDetails" is enabled:

1376.326: [GC [1 CMS-initial-mark: 6481517K(7281088K)] 6536750K(8277888K), 0.0094620 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
1376.336: [CMS-concurrent-mark-start]
1377.992: [CMS-concurrent-mark: 1.656/1.656 secs] [Times: user=6.62 sys=0.00, real=1.66 secs] 
1377.992: [CMS-concurrent-preclean-start]
1378.005: [CMS-concurrent-preclean: 0.014/0.014 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 

1378.006: [CMS-concurrent-abortable-preclean-start]
CMS: abort preclean due to time 1413.110: [CMS-concurrent-abortable-preclean: 4.169/35.104 secs] [Times: user=4.23 sys=0.01, real=35.11 secs] 
1413.111: [GC[YG occupancy: 55233 K (996800 K)]1413.111: [Rescan (parallel) , 0.0154040 secs]1413.126: [weak refs processing, 0.0001380 secs]1413.126: [class unloading, 0.0013450 secs]1413.128: [scrub symbol table, 0.0023140 secs]1413.130: [scrub string table, 0.0004850 secs] [1 CMS-remark: 6481517K(7281088K)] 6536751K(8277888K), 0.0201870 secs] [Times: user=0.20 sys=0.00, real=0.02 secs] 
1413.131: [CMS-concurrent-sweep-start]
1414.151: [CMS-concurrent-sweep: 1.020/1.020 secs] [Times: user=1.02 sys=0.00, real=1.02 secs] 
1414.151: [CMS-concurrent-reset-start]
1414.177: [CMS-concurrent-reset: 0.026/0.026 secs] [Times: user=0.03 sys=0.00, real=0.02 secs]

Also note: the "CMSMaxAbortablePrecleanTime" is set to 35 secs. Thanks.

Solution

There has to be an adequate buffer between the occupied heap and what is available. Enhancing the size of the (max) heap so that the occupied is 70% of the total resolved the issue. This is of particular importance when the algorithm is CMS. Lesson Learnt.