Search code examples
javagarbage-collectionloadg1gc

Java G1GC sudden System usage spike 100%


We are using java 1.8.144_b_01. We have a problem where G1GC sys usage spike suddenly eg: from 1 to 113 , 140 which is very high than user space cpu usage.

At that time Object Copy time becomes very high ~10 times the normal , i see the following logs very often:

2018-03-09T14:55:30.111+0530: 45966.639: [GC pause (G1 Evacuation Pause) (young) (to-space exhausted), 2.1706858 secs]

After this the load avg. start to spike up > 100 and when i see in top i see all the cpu is taken by sys space and mostly it shows 100% there.

All this happens suddenly ~ some fixed time daily on AWS. Sometimes application run without issue for 10 days, sometime daily around same time we see the issue . The usage pattern remain exactly the same every day.

What should i do next?

   [Eden: 2656.0M(13.5G)->0.0B(13.7G) Survivors: 440.0M->592.0M Heap: 14.2G(23.2G)->2547.0M(23.8G)]
 [Times: user=0.00 sys=0.00, real=1.76 secs] 
2018-03-09T14:55:26.647+0530: 45963.175: [GC pause (GCLocker Initiated GC) (young) (initial-mark), 0.1116088 secs]
   [Parallel Time: 97.9 ms, GC Workers: 16]
      [GC Worker Start (ms): Min: 45963178.4, Avg: 45963178.6, Max: 45963178.8, Diff: 0.4]
      [Ext Root Scanning (ms): Min: 5.7, Avg: 6.6, Max: 12.7, Diff: 7.0, Sum: 104.9]
      [Update RS (ms): Min: 40.7, Avg: 47.1, Max: 47.7, Diff: 7.0, Sum: 753.2]
         [Processed Buffers: Min: 181, Avg: 226.9, Max: 310, Diff: 129, Sum: 3631]
      [Scan RS (ms): Min: 3.8, Avg: 4.1, Max: 4.2, Diff: 0.3, Sum: 64.9]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
      [Object Copy (ms): Min: 38.2, Avg: 38.8, Max: 39.3, Diff: 1.1, Sum: 620.9]
      [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 16]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.2, Sum: 1.3]
      [GC Worker Total (ms): Min: 96.3, Avg: 96.6, Max: 96.8, Diff: 0.4, Sum: 1545.4]
      [GC Worker End (ms): Min: 45963275.1, Avg: 45963275.2, Max: 45963275.3, Diff: 0.2]
   [Code Root Fixup: 0.4 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.6 ms]
   [Other: 12.7 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 5.6 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 0.5 ms]
      [Humongous Register: 0.1 ms]
      [Humongous Reclaim: 1.3 ms]
      [Free CSet: 0.6 ms]
   [Eden: 592.0M(13.7G)->0.0B(13.8G) Survivors: 592.0M->504.0M Heap: 4605.8M(23.8G)->2469.3M(23.8G)]
 [Times: user=0.00 sys=0.00, real=0.11 secs] 
2018-03-09T14:55:26.760+0530: 45963.287: [GC concurrent-root-region-scan-start]
2018-03-09T14:55:26.798+0530: 45963.325: [GC concurrent-root-region-scan-end, 0.0379580 secs]
2018-03-09T14:55:26.798+0530: 45963.325: [GC concurrent-mark-start]
2018-03-09T14:55:27.851+0530: 45964.379: [GC concurrent-mark-end, 1.0532587 secs]
2018-03-09T14:55:27.877+0530: 45964.405: [GC remark 2018-03-09T14:55:27.878+0530: 45964.405: [Finalize Marking, 0.0035673 secs] 2018-03-09T14:55:27.881+0530: 45964.409: [GC ref-proc, 0.0021035 secs] 2018-03-09T14:55:27.883+0530: 45964.411: [Unloading, 0.0291335 secs], 0.0380596 secs]
 [Times: user=0.00 sys=0.00, real=0.03 secs] 
2018-03-09T14:55:27.942+0530: 45964.470: [GC cleanup 6702M->6702M(23G), 0.0086775 secs]
 [Times: user=0.00 sys=0.00, real=0.01 secs] 
2018-03-09T14:55:30.111+0530: 45966.639: [GC pause (G1 Evacuation Pause) (young) (to-space exhausted), 2.1706858 secs]
   [Parallel Time: 1692.2 ms, GC Workers: 16]
      [GC Worker Start (ms): Min: 45966642.3, Avg: 45966642.5, Max: 45966642.7, Diff: 0.4]
      [Ext Root Scanning (ms): Min: 1.3, Avg: 3.9, Max: 12.1, Diff: 10.7, Sum: 62.8]
      [Update RS (ms): Min: 30.6, Avg: 37.9, Max: 38.5, Diff: 7.9, Sum: 605.8]
         [Processed Buffers: Min: 166, Avg: 246.8, Max: 343, Diff: 177, Sum: 3949]
      [Scan RS (ms): Min: 4.0, Avg: 4.1, Max: 4.2, Diff: 0.2, Sum: 65.6]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
      [Object Copy (ms): Min: 1640.6, Avg: 1645.6, Max: 1647.7, Diff: 7.1, Sum: 26329.1]
      [Termination (ms): Min: 0.0, Avg: 0.2, Max: 0.3, Diff: 0.3, Sum: 2.4]
         [Termination Attempts: Min: 1, Avg: 4.2, Max: 6, Diff: 5, Sum: 67]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.7]
      [GC Worker Total (ms): Min: 1691.5, Avg: 1691.7, Max: 1691.8, Diff: 0.4, Sum: 27066.6]
      [GC Worker End (ms): Min: 45968334.1, Avg: 45968334.2, Max: 45968334.2, Diff: 0.1]
   [Code Root Fixup: 0.3 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.7 ms]
   [Other: 477.5 ms]
      [Evacuation Failure: 302.2 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 159.1 ms]
      [Ref Enq: 0.1 ms]
      [Redirty Cards: 1.1 ms]
      [Humongous Register: 0.6 ms]
      [Humongous Reclaim: 9.6 ms]
      [Free CSet: 0.3 ms]
   [Eden: 1768.0M(13.8G)->0.0B(13.5G) Survivors: 504.0M->176.0M Heap: 14.1G(23.8G)->4432.1M(24.0G)]
 [Times: user=0.00 sys=0.00, real=2.17 secs] 
2018-03-09T14:55:32.555+0530: 45969.083: [GC pause (G1 Humongous Allocation) (young) (initial-mark), 0.1336821 secs]
   [Parallel Time: 105.7 ms, GC Workers: 16]
      [GC Worker Start (ms): Min: 45969086.4, Avg: 45969086.6, Max: 45969086.7, Diff: 0.3]
      [Ext Root Scanning (ms): Min: 5.8, Avg: 6.6, Max: 13.2, Diff: 7.4, Sum: 105.1]
      [Update RS (ms): Min: 72.0, Avg: 78.6, Max: 79.4, Diff: 7.4, Sum: 1258.2]
         [Processed Buffers: Min: 243, Avg: 337.2, Max: 502, Diff: 259, Sum: 5396]
      [Scan RS (ms): Min: 0.4, Avg: 0.7, Max: 0.8, Diff: 0.5, Sum: 11.8]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Object Copy (ms): Min: 18.0, Avg: 18.4, Max: 18.6, Diff: 0.6, Sum: 294.7]
      [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
         [Termination Attempts: Min: 1, Avg: 1.1, Max: 2, Diff: 1, Sum: 18]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 1.0]
      [GC Worker Total (ms): Min: 104.3, Avg: 104.4, Max: 104.6, Diff: 0.3, Sum: 1670.9]
      [GC Worker End (ms): Min: 45969191.0, Avg: 45969191.0, Max: 45969191.1, Diff: 0.1]
   [Code Root Fixup: 0.6 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.6 ms]
   [Other: 26.8 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 19.6 ms]
      [Ref Enq: 0.1 ms]
      [Redirty Cards: 0.8 ms]
      [Humongous Register: 0.0 ms]
      [Humongous Reclaim: 0.4 ms]
      [Free CSet: 0.2 ms]
   [Eden: 392.0M(13.5G)->0.0B(13.3G) Survivors: 176.0M->224.0M Heap: 5148.1M(24.0G)->4486.3M(24.0G)]
 [Times: user=0.00 sys=0.00, real=0.14 secs] 
2018-03-09T14:55:32.689+0530: 45969.217: [GC concurrent-root-region-scan-start]
2018-03-09T14:55:32.751+0530: 45969.279: [GC concurrent-root-region-scan-end, 0.0620086 secs]
2018-03-09T14:55:32.751+0530: 45969.279: [GC concurrent-mark-start]
2018-03-09T14:55:34.004+0530: 45970.531: [GC concurrent-mark-end, 1.2523056 secs]
2018-03-09T14:55:34.031+0530: 45970.559: [GC remark 2018-03-09T14:55:34.031+0530: 45970.559: [Finalize Marking, 0.0036417 secs] 2018-03-09T14:55:34.035+0530: 45970.563: [GC ref-proc, 0.0171440 secs] 2018-03-09T14:55:34.052+0530: 45970.580: [Unloading, 0.0319188 secs], 0.0563588 secs]
 [Times: user=0.40 sys=0.00, real=0.06 secs] 
2018-03-09T14:55:34.115+0530: 45970.642: [GC cleanup 8540M->8540M(24G), 0.0092065 secs]
 [Times: user=0.12 sys=0.00, real=0.01 secs] 
2018-03-09T14:55:35.889+0530: 45972.417: [GC pause (G1 Humongous Allocation) (young) (to-space exhausted), 1.5352217 secs]
   [Parallel Time: 1216.7 ms, GC Workers: 16]
      [GC Worker Start (ms): Min: 45972420.7, Avg: 45972420.9, Max: 45972421.1, Diff: 0.4]
      [Ext Root Scanning (ms): Min: 1.7, Avg: 3.2, Max: 10.8, Diff: 9.0, Sum: 51.4]
      [Update RS (ms): Min: 44.0, Avg: 45.0, Max: 45.5, Diff: 1.4, Sum: 719.4]
         [Processed Buffers: Min: 209, Avg: 295.3, Max: 401, Diff: 192, Sum: 4725]
      [Scan RS (ms): Min: 7.5, Avg: 7.9, Max: 8.0, Diff: 0.5, Sum: 125.9]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
      [Object Copy (ms): Min: 1153.4, Avg: 1160.0, Max: 1161.4, Diff: 8.0, Sum: 18559.9]
      [Termination (ms): Min: 0.0, Avg: 0.1, Max: 0.3, Diff: 0.3, Sum: 2.1]
         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 16]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.8]
      [GC Worker Total (ms): Min: 1216.0, Avg: 1216.2, Max: 1216.4, Diff: 0.4, Sum: 19459.4]
      [GC Worker End (ms): Min: 45973637.0, Avg: 45973637.1, Max: 45973637.1, Diff: 0.1]
   [Code Root Fixup: 0.3 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.9 ms]
   [Other: 317.3 ms]
      [Evacuation Failure: 302.0 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 0.5 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 1.3 ms]
      [Humongous Register: 0.5 ms]
      [Humongous Reclaim: 7.7 ms]
      [Free CSet: 0.4 ms]
   [Eden: 2504.0M(13.3G)->0.0B(1224.0M) Survivors: 224.0M->0.0B Heap: 15.5G(24.0G)->6956.9M(24.0G)]
 [Times: user=5.67 sys=0.11, real=1.53 secs] 
2018-03-09T14:55:38.964+0530: 45975.492: [GC pause (G1 Evacuation Pause) (mixed), 0.1187935 secs]
   [Parallel Time: 95.2 ms, GC Workers: 16]
      [GC Worker Start (ms): Min: 45975495.3, Avg: 45975495.5, Max: 45975495.8, Diff: 0.5]
      [Ext Root Scanning (ms): Min: 4.3, Avg: 5.7, Max: 13.8, Diff: 9.5, Sum: 91.2]
      [Update RS (ms): Min: 64.9, Avg: 73.0, Max: 74.3, Diff: 9.4, Sum: 1167.3]
         [Processed Buffers: Min: 268, Avg: 343.9, Max: 491, Diff: 223, Sum: 5503]
      [Scan RS (ms): Min: 1.7, Avg: 2.0, Max: 2.1, Diff: 0.4, Sum: 32.4]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
:

VM parameters are:

-XX:+PrintGCDateStamps
-XX:+UseG1GC
-XX:MaxJavaStackTraceDepth=1000000
-XX:MaxGCPauseMillis=500
-Xmx24576m
-Xloggc:/tmp/gc/gc-%t.log
-XX:+PrintGCTimeStamps
-XX:G1ReservePercent=20
-XX:+PrintGCDetails
-XX:MaxPermSize=768m
-XX:+HeapDumpOnOutOfMemoryError
-XX:InitiatingHeapOccupancyPercent=20
-Xms10480m

Solution

  • There are lots of things going on at the same time. Due to InitiatingHeapOccupancyPercent=20 it seems to be starting concurrent marking cycles continously.

    Multiple entries of GC pause (G1 Humongous Allocation) indicate that you're contiously allocating very large objects and there is not enough contiguous free space to accommodate them, presumably due to the previously allocated ones not having been collected yet or due to heap fragmentation, again caused by the huge allocations.

    (to-space exhausted)

    At this point this is just the icing on the cake, the collector can't find enough room to move objects and has to prematurely tenure regions in place.

    You should run your application under an allocation profiler and optimize your code reduce the bursts of huge, short-lived objects.