Not sure if this would be better suited for ServerFault, but since I am not an admin but a developer I figured I would try SO.
We've been struggling to keep our multi-server configuration stable for quite some time now. At the end of last month we were running under CF 7.0.2 on a two servers setup (one instance each). At that point we managed to get our uptime to around 1 week per instance before they would restart by themselves. Since the beginning of the month we upgraded to CF 9 and we're back to square one with multi-restart a day.
Our current configuration is 2 Win2k3 servers, running a cluster of 4 instances, 2 instances per server. At this point we are pretty certain this is due to improper JVM settings.
We've been toying with them and while some are more stable than others we never quite got it right.
From the default:
java.args=-server -Xmx512m -Dsun.io.useCanonCaches=false -XX:MaxPermSize=192m -XX:+UseParallelGC -Dcoldfusion.rootDir={application.home}/
To currently:
java.args=-server -Xmx896m -Dsun.io.useCanonCaches=false -XX:MaxPermSize=512m -XX:SurvivorRatio=8 -XX:TargetSurvivorRatio=90 -XX:+UseParallelGC -Dcoldfusion.rootDir={application.home}/ -verbose:gc -Xloggc:c:/Jrun4/logs/gc/gcInstance1b.log
We have determined that we do need more than the default 512MB simply by monitoring with FusionReactor, on average our amount of memory consumed is hovering in the mid 300MB and can go up to low 700MB under heavy load.
Most of the crash will be logged in jrun4/bin/hs_err_pid*.log always an "Out of swap space"
I've attached links to the hs_err and garbage collector log file from yesterday at the bottom of the post.
The relevant part is (I think) this:
Heap
PSYoungGen total 89856K, used 19025K [0x55490000, 0x5b6f0000, 0x5b810000)
eden space 79232K, 16% used [0x55490000,0x561a64c0,0x5a1f0000)
from space 10624K, 52% used [0x5ac90000,0x5b20e2f8,0x5b6f0000)
to space 10752K, 0% used [0x5a1f0000,0x5a1f0000,0x5ac70000)
PSOldGen total 460416K, used 308422K [0x23810000, 0x3f9b0000, 0x55490000)
object space 460416K, 66% used [0x23810000,0x36541bb8,0x3f9b0000)
PSPermGen total 107520K, used 106079K [0x03810000, 0x0a110000, 0x23810000)
object space 107520K, 98% used [0x03810000,0x09fa7e40,0x0a110000)
From it, I gather that its the PSPermGen that is full (most logs will show the same before a crash), which is why we increased MaxPermSize but the total still show as 107520K!??!
No one here is a jRun expert, so any help or even ideas on what to try next would be greatly appreciated!!
The log files: Sorry I know sendspace isn't the friendliest of places - if you have other host suggestion for log files let me know and I'll update the post (SO doesn't like them inline, it blows up the format of the post).
A little update. I've tried different GCs and while some stabilized the system for a while it kept crashing, only less frequently. So I kept digging and eventually found out that the JVM will throw "Out of swap space" when the OS itself refuses to allocate the memory requested.
This usually happen when the maximum memory is already assigned to the JVM process, this is the jrun overhead, the JVM itself, all the libraries, the heap AND the stack. Since request are living on the stack if you have a lot of requests being spawned the stack will grow and grow. The size of each thread varies according to the OS and version of the JVM but can be controlled using the -Xss argument. I reduced ours to 64k so our java.args looks like this:
java.args=-server -Xmx768m -Xss64k -Dsun.io.useCanonCaches=false -XX:MaxPermSize=512m -XX:+UseParallelGC -Dcoldfusion.rootDir={application.home}/ -verbose:gc -Xloggc:c:/Jrun4/logs/gc/gcInstance2a.log
So far everything has been stable without any noticeable slowdown for 6 days, which is definitely the longest I've ever seen the application stay up. If you reduce the request size too much, you'll start noticing stack overflow errors in the log instead of the OOM error.
My next step will be to tweak the MaxPermSize but so far so good!