java garbage-collection concurrent-mark-sweep

High frequency of CMS tenured phases even when Old/Young gen have ample space

I acknowledge up front that this issue is very similar to: high-number-of-cms-mark/remark-pauses-even-though-old-gen-is-not-half-full and tenured-collection-starting-for-no-apparent-reason. I'm posting because 1. those threads are over 1yr old, and 2. I'm hoping to learn how to find root causes for onset of this behavior.

We have an OAS/OC4J (it's not our fault!) 24/7 Java application server running on RHEL5/Redhat 5.11, Java 6. This has been stable memory-wise for many years until recently when we started seeing high CPU utilization due to frequent CMS tenured space cycles. This happens even when there's more than ample space in both young and tenured spaces. My reading on this topic suggests that a CMS tenured cycle typically commences when tenured (Old gen) space is roughly 92% of capacity. But we see this repeatedly happening with 30% capacity even less. And also, I should mention, when the total heap seems less than the default 45% value for overall heap usage, aka InitiatingHeapOccupancyPercent.

We are still reviewing recent code changes and have tried a few things, but these issues persist. So, far we're unable to reproduce outside of production servers, although efforts in a dev/qa environment are ongoing.

I think I have three main questions here:

What might be triggering the frequent (premature?) Initial-Mark phase of the CMS cycle. And, how might we verify or investigate this? E.g., inspect various segments of current memory allocations (eden, survivor, old-gen) for huge objects or the like?
I've read about using -XX:+UseCMSInitiatingOccupancyOnly and -XX:CMSInitiatingOccupancyFraction=NN (for example in articles referenced above). What might be a reasonable (== safe) value for NN, and what are the risks of overriding the default CMS ergonomics this way?
Any other things we should consider or investigate?

Here are some details about our issue:

So, far we're unable to reproduce this outside of production. So, debugging or tuning isn't an option
We use a nightly cron job to force a Full GC to mitigate fragmentation via jmap -histo:live pid
Our JVM command line args wrt memory are below:

-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:-TraceClassUnloading -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:+HeapDumpOnOutOfMemoryError -XX:+ExplicitGCInvokesConcurrent -XX:+UseCMSCompactAtFullCollection -Xms10g -Xmx10g -Xmn3g -XX:SurvivorRatio=6 -XX:PermSize=256m -XX:MaxPermSize=256m -XX:TargetSurvivorRatio=80 -XX:ParallelGCThreads=8

Note: we recently tried bumping the young generation to 3.5g as a somewhat desperate experiment. (In production!) No real discernible differences observed

Output of jmap -heap. Note: From Space seems to always be 100% occupied. Is this normal, or indicative of anything?:


    using parallel threads in the new generation.
    using thread-local object allocation.
    Concurrent Mark-Sweep GC

Heap Configuration:
   MinHeapFreeRatio = 40
   MaxHeapFreeRatio = 70
   MaxHeapSize      = 10737418240 (10240.0MB)
   NewSize          = 3758096384 (3584.0MB)
   MaxNewSize       = 3758096384 (3584.0MB)
   OldSize          = 5439488 (5.1875MB)
   NewRatio         = 2
   SurvivorRatio    = 6
   PermSize         = 268435456 (256.0MB)
   MaxPermSize      = 268435456 (256.0MB)

Heap Usage:
New Generation (Eden + 1 Survivor Space):
   capacity = 3288334336 (3136.0MB)
   used     = 1048575408 (999.9994354248047MB)
   free     = 2239758928 (2136.0005645751953MB)
   31.88773709900525% used
Eden Space:
   capacity = 2818572288 (2688.0MB)
   used     = 578813360 (551.9994354248047MB)
   free     = 2239758928 (2136.0005645751953MB)
   20.535693282172794% used
From Space:
   capacity = 469762048 (448.0MB)
   used     = 469762048 (448.0MB)
   free     = 0 (0.0MB)
   100.0% used
To Space:
   capacity = 469762048 (448.0MB)
   used     = 0 (0.0MB)
   free     = 469762048 (448.0MB)
   0.0% used
concurrent mark-sweep generation:
   capacity = 6979321856 (6656.0MB)
   used     = 1592989856 (1519.1935119628906MB)
   free     = 5386332000 (5136.806488037109MB)
   22.82442175425016% used
Perm Generation:
   capacity = 268435456 (256.0MB)
   used     = 249858712 (238.2838363647461MB)
   free     = 18576744 (17.716163635253906MB)
   93.07962357997894% used

Output from an in-house GC Log parser, showing frequent initial mark(IM)/remark(RM) cycles, and low young/tenured occupancy. You can see Young gen occupancy slowly grow up to 98.30%, soon after which we do the expected ParNew(PN) Young GC:


               |            |  PN: |YHeapB4|YHeapAf|YHeapDt|
               |            |------|-------|-------|-------|
               |            |PF,CF,|       |       |       |
               |            |  SY: |OHeapB4|OHeapAf|OHeapDt|
               |            |------|-------|-------|-------|
    Date       |Time        |IM,RM:|Y Occ% |OHeap  |O Occ% |Duration|THeapB4|THeapAf|THeapDt|Promoted|% Garbage|Interval| Alloc |AllocRate(MB/s)|PromoRate(MB/s)
    ------------------------------------------------------------------------------------------------------------------------------------------------------------
    2016-12-05|14:16:59.455|  RM  |  15.11|1620287|  23.77|     0.18
    2016-12-05|14:17:03.057|  IM  |  16.16|1615358|  23.70|     0.66
    2016-12-05|14:17:13.444|  RM  |  17.70|1615358|  23.70|     0.23
    2016-12-05|14:17:17.227|  IM  |  18.82|1513691|  22.21|     0.70
    2016-12-05|14:17:27.887|  RM  |  28.54|1513691|  22.21|     0.33
    2016-12-05|14:17:30.390|  IM  |  29.45|1513667|  22.21|     1.02
    2016-12-05|14:17:41.326|  RM  |  32.90|1513667|  22.21|     0.66
    2016-12-05|14:17:44.290|  IM  |  34.86|1513666|  22.21|     1.23
    ...[deleted for brevity]...
    2016-12-05|14:37:28.024|  IM  |  95.88|1377444|  20.21|     2.93
    2016-12-05|14:37:40.601|  RM  |  95.89|1377444|  20.21|     2.15
    2016-12-05|14:37:46.032|  IM  |  95.89|1377443|  20.21|     2.83
    2016-12-05|14:37:58.557|  RM  |  98.30|1377443|  20.21|     2.21
    2016-12-05|14:38:03.988|  IM  |  98.30|1377307|  20.21|     2.90
    2016-12-05|14:38:15.638|  PN  |3211264| 458752|2752512|     0.77|4588571|1942900|2645671|  106841|    96.12
    2016-12-05|14:38:18.193|  RM  |  18.04|1484148|  21.78|     0.24
    2016-12-05|14:38:21.813|  IM  |  18.04|1480802|  21.73|     0.75
    2016-12-05|14:38:31.822|  RM  |  19.05|1480802|  21.73|     0.34
    ...[and so on]...

Actual GC log output starting with the first Initial Mark (IM) at 14:17:03.057 from above output. Similarly truncated as above, but I do show the ParNew Young GC for completeness:



    2016-12-05T14:17:03.057-0800: [GC [1 CMS-initial-mark: 1615358K(6815744K)] 2134211K(10027008K), 0.6538170 secs] [Times: user=0.65 sys=0.00, real=0.66 secs]
    2016-12-05T14:17:06.178-0800: [CMS-concurrent-mark: 2.463/2.467 secs] [Times: user=5.04 sys=0.01, real=2.46 secs]
    2016-12-05T14:17:06.251-0800: [CMS-concurrent-preclean: 0.072/0.073 secs] [Times: user=0.07 sys=0.00, real=0.08 secs]
     CMS: abort preclean due to time 2016-12-05T14:17:13.442-0800: [CMS-concurrent-abortable-preclean: 7.189/7.192 secs] [Times: user=7.46 sys=0.02, real=7.19 secs] 
    2016-12-05T14:17:13.444-0800: [GC[YG occupancy: 568459 K (3211264 K)][Rescan (parallel) , 0.1020240 secs][weak refs processing, 0.0312140 secs][class unloading, 0.0396040 secs][scrub symbol & string tables, 0.0368990 secs] [1 CMS-remark: 1615358K(6815744K)] 2183818K(10027008K), 0.2344980 secs] [Times: user=0.89 sys=0.00, real=0.23 secs]
    2016-12-05T14:17:15.212-0800: [CMS-concurrent-sweep: 1.533/1.533 secs] [Times: user=1.54 sys=0.00, real=1.54 secs]
    2016-12-05T14:17:15.225-0800: [CMS-concurrent-reset: 0.013/0.013 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
    2016-12-05T14:17:17.227-0800: [GC [1 CMS-initial-mark: 1513691K(6815744K)] 2118034K(10027008K), 0.7036950 secs] [Times: user=0.71 sys=0.00, real=0.70 secs]
    2016-12-05T14:17:20.548-0800: [CMS-concurrent-mark: 2.613/2.617 secs] [Times: user=5.62 sys=0.03, real=2.62 secs]
    2016-12-05T14:17:20.667-0800: [CMS-concurrent-preclean: 0.113/0.119 secs] [Times: user=0.23 sys=0.00, real=0.12 secs]
     CMS: abort preclean due to time 2016-12-05T14:17:27.886-0800: [CMS-concurrent-abortable-preclean: 7.217/7.219 secs] [Times: user=8.54 sys=0.07, real=7.22 secs] 
    2016-12-05T14:17:27.887-0800: [GC[YG occupancy: 916526 K (3211264 K)][Rescan (parallel) , 0.2159770 secs][weak refs processing, 0.0000180 secs][class unloading, 0.0460640 secs][scrub symbol & string tables, 0.0404060 secs] [1 CMS-remark: 1513691K(6815744K)] 2430218K(10027008K), 0.3276590 secs] [Times: user=1.59 sys=0.02, real=0.33 secs]
    2016-12-05T14:17:29.611-0800: [CMS-concurrent-sweep: 1.396/1.396 secs] [Times: user=1.40 sys=0.00, real=1.39 secs]
    ...[And So On]...
    2016-12-05T14:38:03.988-0800: [GC [1 CMS-initial-mark: 1377307K(6815744K)] 4534072K(10027008K), 2.9013180 secs] [Times: user=2.90 sys=0.00, real=2.90 secs]
    2016-12-05T14:38:09.403-0800: [CMS-concurrent-mark: 2.507/2.514 secs] [Times: user=5.03 sys=0.03, real=2.51 secs]
    2016-12-05T14:38:09.462-0800: [CMS-concurrent-preclean: 0.058/0.058 secs] [Times: user=0.06 sys=0.00, real=0.06 secs]
    2016-12-05T14:38:15.638-0800: [GC [ParNew
    Desired survivor size 375809632 bytes, new threshold 4 (max 15)
    - age   1:  115976192 bytes,  115976192 total
    - age   2:  104282224 bytes,  220258416 total
    - age   3:   85871464 bytes,  306129880 total
    - age   4:   98122648 bytes,  404252528 total
    : 3211264K->458752K(3211264K), 0.7731320 secs] 4588571K->1942900K(10027008K), 0.7732860 secs] [Times: user=3.15 sys=0.00, real=0.77 secs]
     CMS: abort preclean due to time 2016-12-05T14:38:18.192-0800: [CMS-concurrent-abortable-preclean: 7.842/8.730 secs] [Times: user=12.50 sys=0.07, real=8.73 secs]
    2016-12-05T14:38:18.193-0800: [GC[YG occupancy: 579220 K (3211264 K)][Rescan (parallel) , 0.1208810 secs][weak refs processing, 0.0008320 secs][class unloading, 0.0483220 secs][scrub symbol & string tables, 0.0414970 secs] [1 CMS-remark: 1484148K(6815744K)] 2063368K(10027008K), 0.2376050 secs] [Times: user=1.07 sys=0.00, real=0.24 secs]
    2016-12-05T14:38:19.798-0800: [CMS-concurrent-sweep: 1.366/1.366 secs] [Times: user=1.40 sys=0.00, real=1.37 secs]
    2016-12-05T14:38:19.811-0800: [CMS-concurrent-reset: 0.012/0.012 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
    2016-12-05T14:38:21.813-0800: [GC [1 CMS-initial-mark: 1480802K(6815744K)] 2060239K(10027008K), 0.7487000 secs] [Times: user=0.75 sys=0.00, real=0.75 secs]

Going from Alexey's great observation and suggestions, we're going to try bumping Perm Generation in production (and I'll report back). But as preliminary validation of his guess, I did a survey of perm gen usage for all container JVMs on one of our hosts and it seems very plausible. In the snippet below, PID=2979 (at 92% perm gen capacity) is the one exhibiting the constant CMS collection behavior.



    [oracle@ikm-oasb-3 bin]$ for p in `opmnctl status | grep OC4JG | awk '{print $5}'`; do echo -n "PID=$p "; jmap -heap $p | grep -A4 'Perm Gen' | egrep '%'; done 2> /dev/null
    PID=8456    89.31778371334076% used
    PID=8455    89.03931379318237% used
    PID=8454    91.1630779504776% used
    PID=8453    89.17466700077057% used
    PID=8452    87.69496977329254% used
    PID=2979    92.2750473022461% used
    PID=1884    90.25585949420929% used
    PID=785    76.16643011569977% used
    PID=607    89.06879723072052% used

Solution

CMS old space clean up cycle it triggers when either old space reaches occupancy threshold or permanent space reaches threshold.

Prior to Java 8 permanent space is a part of garbage collected heap and in scope of CMS algorithm.

In your case perm is at 93%

Perm Generation:
   capacity = 268435456 (256.0MB)
   used     = 249858712 (238.2838363647461MB)
   free     = 18576744 (17.716163635253906MB)
   93.07962357997894% used

This article describes similar case.

If would advise you to increase perm space or you can configure separate occupancy threshold for perm space using –XX: CMSInitiatingPermOccupancyFraction=95 option.

More GC tuning options.