Search code examples
redissolariszombie-process

Zombie process in solaris 10 even with wait


I'm working on getting Redis to run on Solaris 10 and there's a few integration tests that are failing. The test I'm looking into works like this:

  • Start Redis
  • It forks and the child starts dumping the database to a backup file (RDB)
    • There's actually a parent / child / grandchild relationship going on where the grandchild becomes a zombie, but I noticed that only minutes before I had to head home.
  • After a short time the test script sends SIGTERM to the child
  • The child catches the signal & shuts down gracefully
  • The parent calls wait3()

In spite of the wait3() call the child ends up in a zombie state.

The test fails around 90% of the time when I run it. Once it gets into a failed state it never recovers. I tried changing the test to wait significantly longer and although it appears to call wait3() many times after the process has exited, it stays in that state until the parent process(es) are killed.

Unfortunately I won't be able to work on this again until next week, so I'm researching it from home. Most of my googling has only turned up documentation or "why do processes become zombies?" type questions.

This google groups thread from the mid 90s may help, though they're mostly talking about older releases of Solaris / SunOS.


Solution

  • I was mistaken. It looks like the master node doesn't see that its child failed so doesn't wait.