Search code examples
memory-leakssolarispspmaplibumem

Solaris : pmap reports a different virtual memory size than ps


I have a process running on Solaris (SunOS m1001 5.10 sun4v sparc) and was monitoring the total virtual memory used.

Periodically running ps showed that the VSZ was growing linearly over time with jumps of 80kbytes and that it keeps growing until it reaches the 4GB limit at which point it's out of address space and things start to fall apart.

while true; do ps -ef -o pid,vsz,rss|grep 27435 ; sleep 5; done > ps.txt

I suspected a memory leak and decided to further investigate with pmap. But pmap shows that VSZ is not growing at all but rather stays stable. Also all file maps, shared memory maps and heap kept the same size.

while true; do pmap -x 27435 |grep total; sleep 5; done > pmap.txt

My first question is: Why do ps and pmap produce a different VSZ for the same process?

I can imagine that heap sizes are calculated differently (e.g. heap usage vs highest heap pointer), so started thinking in the direction of heap fragmentation. I then used libumem and mdb to produce detailed reports about allocted memory at different times and noticed that there was absolutely no difference in allocated memory.

 mdb 27435 < $umem_cmds
 ::walk thread |::findstack !tee>>umemc-findstack.log
 ::umalog !tee>>umem-umalog.log
 ::umastat !tee>>umem-umastat.log
 ::umausers !tee>umem-umausers.log
 ::umem_cache !tee>>umem-umem_cache.log
 ::umem_log !tee>>umem-umem_log.log
 ::umem_status !tee>>umem-umem_status.log
 ::umem_malloc_dist !tee>>umem-umem_malloc_dist.log
 ::umem_malloc_info !tee>>umem-umem_malloc_info.log
 ::umem_verify !tee>>umem-umem_verify.log
 ::findleaks -dv !tee>>umem-findleaks.log
 ::vmem !tee>>umem-vmem.log
 *umem_oversize_arena::walk vmem_alloc | ::vmem_seg -v !tee>umem-    oversize.log
 *umem_default_arena::walk vmem_alloc | ::vmem_seg -v !tee>umem-default.log

So my second question is: what is the best way to figure out what is causing the growing VSZ reported by ps.


Solution

  • I noticed that this question was still open and wanted to add how this story ended.

    After a lot more digging I contacted customer support from Solari and send them a way to reproduce the problem. They confirmed that there was a bug in the kernel which caused this behavior.

    Unfortunately I cannot confirm that they rolled out a patch, since I left the company I was working for back then since.

    Thx, Jef