Apache Storm address already in use error

From time to time I have an error in circle like this:

022-11-07 06:34:30.977 o.a.s.m.n.Server main [INFO] Create Netty Server Netty-server-localhost-6704, buffer_size: 5242880, maxWorkers: 1
2022-11-07 06:34:31.566 o.a.s.u.Utils main [ERROR] Received error in thread main.. terminating worker...
java.lang.Error: java.security.PrivilegedActionException: java.net.BindException: Address already in use
    at org.apache.storm.utils.Utils.handleUncaughtException(Utils.java:663) ~[storm-client-2.4.0.jar:2.4.0]
    at org.apache.storm.utils.Utils.handleWorkerUncaughtException(Utils.java:671) ~[storm-client-2.4.0.jar:2.4.0]
    at org.apache.storm.utils.Utils.lambda$createWorkerUncaughtExceptionHandler$3(Utils.java:1058) ~[storm-client-2.4.0.jar:2.4.0]
    at java.lang.ThreadGroup.uncaughtException(ThreadGroup.java:1055) [?:?]
    at java.lang.ThreadGroup.uncaughtException(ThreadGroup.java:1050) [?:?]
    at java.lang.Thread.dispatchUncaughtException(Thread.java:2002) [?:?]

Storm is trying to set up a new topology but it can't.

I have been facing it for a long period of time after deploying to Storm. Once I was able to fix it by setting a big timeout (about 300secs) between deleting old topologies and setting up new ones and increased workerShutdownSleepSecs config. In this case, Storm was able to delete all blobs to work fine because I noticed in logs that Storm needed some time to have everything deleted even after topology was completely deleted.

But now I faced it again but with all actions above but in a smaller QA environment. Does anybody know what else can cause it?

Also in the supervisor machine, I checked the storm data folder and in "/storm/workers" folder I found that there are some old folders with id in their name and I assume that storm did not delete old topologies again.

I think it should be something very common because storm as I noticed fails almost after everything it tries to do on its own so somebody faced it already I guess.

My storm.yaml: (storm.scheduler: "org.apache.storm.scheduler.resource.ResourceAwareScheduler" config just for testing but I don't think it affects something)

storm.zookeeper.servers:
  - storm-nimbus-cloud-qa1
  - storm-nimbus-cloud-qa2
  - storm-nimbus-cloud-qa3

nimbus.seeds: ["storm-nimbus-cloud-qa1", "storm-nimbus-cloud-qa2", "storm-nimbus-cloud-qa3"]
storm.local.dir: /data/ansible/storm
supervisor.slots.ports:
  - 6700
  - 6701
  - 6702
  - 6703
  - 6704

storm.log.dir: "/data/ansible/storm_logging"

nimbus.childopts: "-Xmx512m -Djava.net.preferIPv4Stack=true"

ui.childopts: "-Xmx512m -Djava.net.preferIPv4Stack=true"
ui.port: 8080

supervisor.childopts: "-Xmx512m -Djava.net.preferIPv4Stack=true"
supervisor.cpu.capacity: 200.0
supervisor.memory.capacity.mb: 3072.0

worker.childopts: "-Djava.net.preferIPv4Stack=true"
worker.heap.memory.mb: 512

topology.component.cpu.pcore.percent: 5.0

blacklist.scheduler.assume.supervisor.bad.based.on.bad.slot: false
nimbus.topology.blobstore.deletion.delay.ms: 120000
supervisor.worker.shutdown.sleep.secs: 60
scheduler.display.resource: true
storm.scheduler: "org.apache.storm.scheduler.resource.ResourceAwareScheduler"

logviewer.cleanup.interval.secs: 3600
logviewer.max.per.worker.logs.size.mb: 512
logviewer.max.sum.worker.logs.size.mb: 2560
logviewer.cleanup.age.mins: 20160

storm.messaging.netty.max_retries: 300
storm.messaging.netty.max_wait_ms: 10000
storm.messaging.netty.min_wait_ms: 1000

Also I checked supervisor and worker logs and this is all what I found: All logs in supervisor.log related to tooplogy

    Line  6493: 2022-11-04 11:09:55.880 o.a.s.d.s.BasicContainer SLOT_6704 [INFO] Created Worker ID 7e1e50ed-0fba-4d8b-8c62-301edfaf32b0
    Line  6494: 2022-11-04 11:09:55.880 o.a.s.d.s.Container SLOT_6704 [INFO] Setting up 6a061042-8ce3-4b65-ab1b-46fd67a63093-172.23.16.27:7e1e50ed-0fba-4d8b-8c62-301edfaf32b0
    Line  6495: 2022-11-04 11:09:55.881 o.a.s.d.s.Container SLOT_6704 [INFO] GET worker-user for 7e1e50ed-0fba-4d8b-8c62-301edfaf32b0
    Line  6496: 2022-11-04 11:09:55.882 o.a.s.d.s.Container SLOT_6704 [INFO] SET worker-user 7e1e50ed-0fba-4d8b-8c62-301edfaf32b0 stormadmin
    Line  6497: 2022-11-04 11:09:55.882 o.a.s.d.s.Container SLOT_6704 [INFO] Creating symlinks for worker-id: 7e1e50ed-0fba-4d8b-8c62-301edfaf32b0 storm-id: EventHandler-17-1667560186 for files(1): [resources]
    Line  6498: 2022-11-04 11:09:55.882 o.a.s.d.s.BasicContainer SLOT_6704 [INFO] Launching worker with assignment LocalAssignment(topology_id:EventHandler-17-1667560186, executors:[ExecutorInfo(task_start:4, task_end:4)], resources:WorkerResources(mem_on_heap:128.0, mem_off_heap:0.0, cpu:5.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=128.0, cpu.pcore.percent=5.0}, shared_resources:{}), owner:stormadmin) for this supervisor 6a061042-8ce3-4b65-ab1b-46fd67a63093-172.23.16.27 on port 6704 with id 7e1e50ed-0fba-4d8b-8c62-301edfaf32b0
    Line  6499: 2022-11-04 11:09:55.883 o.a.s.d.s.BasicContainer SLOT_6704 [INFO] Launching worker with command: 'java' '-cp' '/usr/local/apache-storm-2.4.0/lib-worker/*:/usr/local/apache-storm-2.4.0/extlib/*:/opt/storm/conf:/data/ansible/storm/supervisor/stormdist/EventHandler-17-1667560186/stormjar.jar' '-Xmx64m' '-Dlogging.sensitivity=S3' '-Dlogfile.name=worker.log' '-Dstorm.home=/usr/local/apache-storm-2.4.0' '-Dworkers.artifacts=/data/ansible/storm_logging/workers-artifacts' '-Dstorm.id=EventHandler-17-1667560186' '-Dworker.id=7e1e50ed-0fba-4d8b-8c62-301edfaf32b0' '-Dworker.port=6704' '-Dstorm.log.dir=/data/ansible/storm_logging' '-DLog4jContextSelector=org.apache.logging.log4j.core.selector.BasicContextSelector' '-Dstorm.local.dir=/data/ansible/storm' '-Dworker.memory_limit_mb=128' '-Dlog4j.configurationFile=/usr/local/apache-storm-2.4.0/log4j2/worker.xml' 'org.apache.storm.LogWriter' 'java' '-server' '-Dlogging.sensitivity=S3' '-Dlogfile.name=worker.log' '-Dstorm.home=/usr/local/apache-storm-2.4.0' '-Dworkers.artifacts=/data/ansible/storm_logging/workers-artifacts' '-Dstorm.id=EventHandler-17-1667560186' '-Dworker.id=7e1e50ed-0fba-4d8b-8c62-301edfaf32b0' '-Dworker.port=6704' '-Dstorm.log.dir=/data/ansible/storm_logging' '-DLog4jContextSelector=org.apache.logging.log4j.core.selector.BasicContextSelector' '-Dstorm.local.dir=/data/ansible/storm' '-Dworker.memory_limit_mb=128' '-Dlog4j.configurationFile=/usr/local/apache-storm-2.4.0/log4j2/worker.xml,topology_logger.xml' '-Djava.net.preferIPv4Stack=true' '-javaagent:/opt/storm/agent/dd-java-agent.jar' '-Ddd.env=qa' '-Ddd.service=EventHandler' '-Djava.net.preferIPv4Stack=true' '-Ddd.logs.injection=true' '-Djava.library.path=/data/ansible/storm/supervisor/stormdist/EventHandler-17-1667560186/resources/Linux-amd64:/data/ansible/storm/supervisor/stormdist/EventHandler-17-1667560186/resources:/usr/local/lib:/opt/local/lib:/usr/lib:/usr/lib64' '-Dstorm.conf.file=' '-Dstorm.options=' '-Djava.io.tmpdir=/data/ansible/storm/workers/7e1e50ed-0fba-4d8b-8c62-301edfaf32 ...
    Line  6503: 2022-11-04 11:09:55.899 o.a.s.d.s.Slot SLOT_6704 [INFO] STATE waiting-for-blob-localization msInState: 42 -> waiting-for-worker-start msInState: 0 topo:EventHandler-17-1667560186 worker:7e1e50ed-0fba-4d8b-8c62-301edfaf32b0
    Line  6515: 2022-11-04 11:10:18.981 o.a.s.d.s.Slot SLOT_6704 [INFO] STATE waiting-for-worker-start msInState: 23082 topo:EventHandler-17-1667560186 worker:7e1e50ed-0fba-4d8b-8c62-301edfaf32b0 -> kill-blob-update msInState: 1 topo:EventHandler-17-1667560186 worker:7e1e50ed-0fba-4d8b-8c62-301edfaf32b0
    Line  6516: 2022-11-04 11:10:18.981 o.a.s.d.s.Container SLOT_6704 [INFO] Cleaning up 6a061042-8ce3-4b65-ab1b-46fd67a63093-172.23.16.27:7e1e50ed-0fba-4d8b-8c62-301edfaf32b0
    Line  6517: 2022-11-04 11:10:18.981 o.a.s.d.s.AdvancedFSOps SLOT_6704 [INFO] Deleting path /data/ansible/storm/workers/7e1e50ed-0fba-4d8b-8c62-301edfaf32b0/heartbeats
    Line  6518: 2022-11-04 11:10:18.982 o.a.s.d.s.AdvancedFSOps SLOT_6704 [INFO] Deleting path /data/ansible/storm/workers/7e1e50ed-0fba-4d8b-8c62-301edfaf32b0/pids
    Line  6519: 2022-11-04 11:10:18.982 o.a.s.d.s.AdvancedFSOps SLOT_6704 [INFO] Deleting path /data/ansible/storm/workers/7e1e50ed-0fba-4d8b-8c62-301edfaf32b0/tmp
    Line  6520: 2022-11-04 11:10:18.982 o.a.s.d.s.AdvancedFSOps SLOT_6704 [INFO] Deleting path /data/ansible/storm/workers/7e1e50ed-0fba-4d8b-8c62-301edfaf32b0
    Line  6521: 2022-11-04 11:10:18.982 o.a.s.d.s.Container SLOT_6704 [INFO] REMOVE worker-user 7e1e50ed-0fba-4d8b-8c62-301edfaf32b0
    Line  6522: 2022-11-04 11:10:18.982 o.a.s.d.s.AdvancedFSOps SLOT_6704 [INFO] Deleting path /data/ansible/storm/workers-users/7e1e50ed-0fba-4d8b-8c62-301edfaf32b0
    Line  6531: 2022-11-04 11:10:18.990 o.a.s.d.s.BasicContainer SLOT_6704 [INFO] Removed Worker ID 7e1e50ed-0fba-4d8b-8c62-301edfaf32b0

All logs in worker.log related to topology

    Line      5: 2022-11-04 11:10:30.922 o.a.s.d.w.Worker main [INFO] Launching worker for EventHandler-17-1667560186 on 6a061042-8ce3-4b65-ab1b-46fd67a63093-172.23.16.27:6704 with id 7e1e50ed-0fba-4d8b-8c62-301edfaf32b0 and conf {storm.messaging.netty.min_wait_ms=1000, topology.backpressure.wait.strategy=org.apache.storm.policy.WaitStrategyProgressive, storm.resource.isolation.plugin=org.apache.storm.container.cgroup.CgroupManager, storm.zookeeper.auth.user=null, storm.messaging.netty.buffer_size=5242880, storm.exhibitor.port=8080, topology.bolt.wait.progressive.level1.count=1, pacemaker.auth.method=NONE, storm.oci.cgroup.root=/sys/fs/cgroup, ui.filter=null, worker.profiler.enabled=false, executor.metrics.frequency.secs=60, supervisor.thrift.threads=16, ui.http.creds.plugin=org.apache.storm.security.auth.DefaultHttpCredentialsPlugin, supervisor.supervisors.commands=[], supervisor.queue.size=128, logviewer.cleanup.age.mins=20160, topology.tuple.serializer=org.apache.storm.serialization.types.ListDelegateSerializer, storm.cgroup.memory.enforcement.enable=false, drpc.port=3772, supervisor.localizer.update.blob.interval.secs=30, topology.max.spout.pending=null, topology.transfer.buffer.size=1000, storm.oci.nscd.dir=/var/run/nscd, nimbus.worker.heartbeats.recovery.strategy.class=org.apache.storm.nimbus.TimeOutWorkerHeartbeatsRecoveryStrategy, worker.metrics={CGroupMemory=org.apache.storm.metrics2.cgroup.CGroupMemoryUsage, CGroupMemoryLimit=org.apache.storm.metrics2.cgroup.CGroupMemoryLimit, CGroupCpu=org.apache.storm.metrics2.cgroup.CGroupCpu, CGroupCpuGuarantee=org.apache.storm.metrics2.cgroup.CGroupCpuGuarantee, CGroupCpuGuaranteeByCfsQuota=org.apache.storm.metrics2.cgroup.CGroupCpuGuaranteeByCfsQuota, CGroupCpuStat=org.apache.storm.metrics2.cgroup.CGroupCpuStat}, logviewer.port=8000, worker.childopts=-Djava.net.preferIPv4Stack=true, topology.component.cpu.pcore.percent=5.0, storm.daemon.metrics.reporter.plugins=[org.apache.storm.daemon.metrics.reporters.JmxPreparableReporter], worker.max.timeout.secs=600, blac ...
    Line     16: 2022-11-04 11:10:34.608 o.a.s.s.o.a.z.ZooKeeper main [INFO] Client environment:java.io.tmpdir=/data/ansible/storm/workers/7e1e50ed-0fba-4d8b-8c62-301edfaf32b0/tmp
    Line     23: 2022-11-04 11:10:34.664 o.a.s.s.o.a.z.ZooKeeper main [INFO] Client environment:user.dir=/data/ansible/storm/workers/7e1e50ed-0fba-4d8b-8c62-301edfaf32b0
    Line     71: 2022-11-04 11:10:50.097 o.a.s.d.w.WorkerState Netty-server-localhost-6704-worker-1 [INFO] Sending BackPressure status to new client. BPStatus: {worker=7e1e50ed-0fba-4d8b-8c62-301edfaf32b0, bpStatusId=1, bpTasks=[], nonBpTasks=[4]}
    Line     73: 2022-11-04 11:10:51.622 o.a.s.d.w.WorkerState refresh-active-timer [INFO] All connections are ready for worker 6a061042-8ce3-4b65-ab1b-46fd67a63093-172.23.16.27:6704 with id 7e1e50ed-0fba-4d8b-8c62-301edfaf32b0
    Line     83: 2022-11-04 11:10:53.841 o.a.s.d.w.Worker main [INFO] Worker 7e1e50ed-0fba-4d8b-8c62-301edfaf32b0 for storm EventHandler-17-1667560186 on 6a061042-8ce3-4b65-ab1b-46fd67a63093-172.23.16.27:6704  has finished loading

So I see that supervisor waited for about 20secs to consider to kill topology (want to remind that this env is not fast so maybe it takes some time to up topology) from 11:09:55 to 11:10:18 but the worker finished setting up the topology at 11:10:53 and that's maybe why nothing was killed from /storm/workers and those idle workers keep sending heartbeats and other stuff. So as I understand I need some config to ask storm to wait about a min before checking the topology state?

UPD: I faced the same issue but on prod env again already. it seems that it does not depend on slow or fast machines. The logs absolutely the same but the time was different (at 02:15:45 supervisor launched a worker and after 2! secs status was changed "waiting-for-worker-start msInState: 2002 -> kill-blob-update msInState: 0". But why? I started to search in the code and I found that in Slot class in handleWaitingForWorkerStart method there is a code as:

dynamicState = filterChangingBlobsFor(dynamicState, dynamicState.currentAssignment);
if (!dynamicState.changingBlobs.isEmpty()) {
    //Kill the container and restart it
    return killContainerFor(KillReason.BLOB_CHANGED, dynamicState, staticState);
}

Does not know what changingBlobs collection means (maybe some blobs are being changed) but looks like here status was changed and the worker was marked to be killed. Timeout is different I think because before in that method it waits for a heartbeat but on different machines it can take more or less time. So by what reason supervisor changed the status and why worker started launching itself after supervisor deleted everything related to this worker?

Solution

I found a solution to solve it and it solves the issue but not the thing causing it on the storm side. So we know that sometimes the storm supervisor asks a worker to up the topology. Then the storm waits for some time and decides to remove it because of no response from the worker and after this worker starts deploying a topology. And in the final end storm does not know about the topology opened on some port (and thinks this port is free) but it exists and causing Address already in use exceptions later on in a cycle. Manually I always fixed it by killing the pid on the port and it starts working after so if we know that its a storm 100% it would be great to have a mechanism to close it. So I ve cloned original Storm git repo and tried find an easy way to close it. its for example "storm client" project. Worker and Utils classes. Changes, Utils class adding:

    public static UncaughtExceptionHandler createWorkerUncaughtExceptionHandler(String port) {
    return (thread, thrown) -> {
        try {
            try {
                String message = thrown.getMessage();
                Throwable cause = thrown.getCause();
                if (thrown instanceof BindException || cause instanceof BindException
                        || (message != null && message.contains("BindException"))) {
                    Process process = new ProcessBuilder().command("lsof", "-t",
                            String.format("-i:%s", port)).start();
                    try (BufferedReader reader = new BufferedReader(
                            new InputStreamReader(process.getInputStream()))) {
                        String pid;
                        if ((pid = reader.readLine()) != null) {
                            new ProcessBuilder().command("kill", "-9", pid).start();
                            LOG.error("killed pid " + pid);
                        }
                    }
                    LOG.error(String.format(
                            "Received BindException error on %s port, process was closed on this port", port));
                }
            } catch (Exception e) {
                LOG.error(String.format(
                        "Received BindException error on %s port, process was not closed on this port", port), e);
            }
            handleWorkerUncaughtException(thrown);
        } catch (Error err) {
            LOG.error("Received error in thread {}.. port " + port + ".. terminating worker...", thread.getName(),
                    err);
            Runtime.getRuntime().exit(-2);
        }
    };
}
public static void setupWorkerUncaughtExceptionHandler(String port) {
    Thread.setDefaultUncaughtExceptionHandler(createWorkerUncaughtExceptionHandler(port));
}

And in Worker class call in main method our setupWorkerUncaughtExceptionHandler with the port:

public static void main(String[] args) throws Exception {
    Preconditions.checkArgument(args.length == 5, "Illegal number of arguments. Expected: 5, Actual: " + args.length);
    String stormId = args[0];
    String assignmentId = args[1];
    String supervisorPort = args[2];
    String portStr = args[3];
    String workerId = args[4];
    Map<String, Object> conf = ConfigUtils.readStormConfig();
    //Changes
    Utils.setupWorkerUncaughtExceptionHandler(portStr);
    //Changes
    StormCommon.validateDistributedMode(conf);
    int supervisorPortInt = Integer.parseInt(supervisorPort);
    Worker worker = new Worker(conf, null, stormId, assignmentId, supervisorPortInt, Integer.parseInt(portStr), workerId);

    //Add shutdown hooks before starting any other threads to avoid possible race condition
    //between invoking shutdown hooks and registering shutdown hooks. See STORM-3658.
    int workerShutdownSleepSecs = ObjectReader.getInt(conf.get(Config.SUPERVISOR_WORKER_SHUTDOWN_SLEEP_SECS));
    LOG.info("Adding shutdown hook with kill in {} secs", workerShutdownSleepSecs);
    Utils.addShutdownHookWithDelayedForceKill(worker::shutdown, workerShutdownSleepSecs);

    worker.start();
}

So in the final end if we catch the Address already in use exception we will close the process. Not the best but fast and working solution. Then we only need to build this library and replace it in storm folder in all nodes. But you should be sure that nothing else can be opened on storm ports and it's better to run storm on its own user in Linux. In this case, Storm will not be able to close any other process except for opened by storm itself. Library is build based on 2.4.0 storm version. My library build: https://gitlab.com/nikita_poddubskiy/storm-address-already-in-use Also opened a question here, but no response still. https://lists.apache.org/list?user@storm.apache.org:2022-12