Search code examples
apache-storm

windows storm supervisor error quit when storm kill topology


Storm version: 1.2.2 Platform: Windows Server 2008

I already have storm cluster on one linux server and one Windows server. This two servers all have deployed nimbus and supervisor services. I started one topology, then I kill it. I found the supervisor process on the Windows server error quit, the worker process on the Windows server still alive.

It shows that:

"error: cannot kill pid xxx process, can only terminate this process(use \F option)."

The error msg translate from the following picture: error-info-pic

I have no idea with this error, I already use google to search some answers, but nothing found, so I send this message to you. I hope you can help me.

Updated at 2018/12/24

I found that the worker will start one topology processs, first kill topology error and then kill worker error when kill supervisor.

I had compile a new one storm-core.jar, and added some detail log when supervisor kill worker, the error detail log as follows:

org.apache.storm.shade.org.apache.commons.exec.ExecuteException: Process exited with an error: 128 (Exit value: 128) at org.apache.storm.shade.org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:377) ~[storm-core-1.2.2.jar:1.2.2] at org.apache.storm.shade.org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:160) ~[storm-core-1.2.2.jar:1.2.2] at org.apache.storm.shade.org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:147) ~[storm-core-1.2.2.jar:1.2.2] at org.apache.storm.utils.Utils.execCommand(Utils.java:1914) ~[storm-core-1.2.2.jar:1.2.2] at org.apache.storm.utils.Utils.sendSignalToProcess(Utils.java:1943) [storm-core-1.2.2.jar:1.2.2] at org.apache.storm.utils.Utils.killProcessWithSigTerm(Utils.java:1962) [storm-core-1.2.2.jar:1.2.2] at org.apache.storm.daemon.supervisor.Container.kill(Container.java:166) [storm-core-1.2.2.jar:1.2.2] at org.apache.storm.daemon.supervisor.Container.kill(Container.java:184) [storm-core-1.2.2.jar:1.2.2] at org.apache.storm.daemon.supervisor.Slot.killContainerForChangedAssignment(Slot.java:311) [storm-core-1.2.2.jar:1.2.2] at org.apache.storm.daemon.supervisor.Slot.handleRunning(Slot.java:527) [storm-core-1.2.2.jar:1.2.2] at org.apache.storm.daemon.supervisor.Slot.stateMachineStep(Slot.java:265) [storm-core-1.2.2.jar:1.2.2] at org.apache.storm.daemon.supervisor.Slot.run(Slot.java:752) [storm-core-1.2.2.jar:1.2.2]


Solution

  • I complie a new one storm-core.jar, in "org.apache.storm.utils.Utils::sendSignalToProcess" function, I added some log msg, as follows:

    public static void sendSignalToProcess(long lpid, int signum) throws IOException {
        String pid = Long.toString(lpid);
        try {
            // add this log
            LOG.info("Added: {}.", signum);
            if (isOnWindows()) {
                // change this condition
                if (signum == SIGKILL || signum == SIGTERM) {
                    // change this code
                    execCommand("taskkill", "/F", "/T", "/pid", pid);
                } else {
                    execCommand("taskkill", "/pid", pid);
                }
            } else {
                execCommand("kill", "-" + signum, pid);
            }
        } catch (ExecuteException e) {
            LOG.info("Error when trying to kill {}. Process is probably already dead.", pid);
        } catch (IOException e) {
            LOG.info("IOException Error when trying to kill {}.", pid);
            throw e;
        }
    }
    

    I found that supervisor send signal 15(term) to worker when storm kill topology, but supervisor cannot kill the worker when use signal 15, It must use sigal 9(kill) to force kill. So I decide to use the new compiled storm-code.jar on the Windows servers.

    I still do not know why the supervisor cannot kill the worker use signal 15 and only use signal 9(namely: taskkill use /F option can kill the worker), but this should be a windows issue, so I close this question.