Search code examples
javamultithreading

Java 22.0.2 VirtualThread calls DatagramSocket.receive under heavy load hangs forever


When I use following threadpool with heavy load.

private final ScheduledExecutorService serverExecutor2 = Executors.newScheduledThreadPool(2000, Thread.ofVirtual().name("COAP-SERVER2-", 0).factory());

it seems to stuck: serverExecutor2=@ScheduledThreadPoolExecutor[java.util.concurrent.ScheduledThreadPoolExecutor@327e5be5[Running, pool size = 2000, active threads = 0, queued tasks = 60000, completed tasks = 97847]],

All the correspond virtual thread is parking:

#75 "COAP-SERVER2-5" virtual
      java.base/java.lang.VirtualThread.park(VirtualThread.java:592)
      java.base/java.lang.System$2.parkVirtualThread(System.java:2639)
      java.base/jdk.internal.misc.VirtualThreads.park(VirtualThreads.java:54)
      java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:369)
      java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block(AbstractQueuedSynchronizer.java:519)
      java.base/java.util.concurrent.ForkJoinPool.unmanagedBlock(ForkJoinPool.java:3780)
      java.base/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3725)
      java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1707)
      java.base/java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1177)
      java.base/java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:899)
      java.base/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1070)
      java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
      java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
      java.base/java.lang.VirtualThread.run(VirtualThread.java:311)

And it seems to be caused by blocking of VirtualThread-unparker while it has a tons of tasks waiting:

"VirtualThread-unparker" #92 [593799] daemon prio=5 os_prio=0 cpu=2238.28ms elapsed=373.27s tid=0x00007f1b3002fd50 nid=593799 waiting on condition  [0x00007f1e6db43000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
        - parking to wait for  <0x00001000177791b8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.parkNanos([email protected]/LockSupport.java:269)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos([email protected]/AbstractQueuedSynchronizer.java:1758)
        at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take([email protected]/ScheduledThreadPoolExecutor.java:1182)
        at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take([email protected]/ScheduledThreadPoolExecutor.java:899)
        at java.util.concurrent.ThreadPoolExecutor.getTask([email protected]/ThreadPoolExecutor.java:1070)
        at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1130)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:642)
        at java.lang.Thread.runWith([email protected]/Thread.java:1583)
        at java.lang.Thread.run([email protected]/Thread.java:1570)
        at jdk.internal.misc.InnocuousThread.run([email protected]/InnocuousThread.java:186)

   Locked ownable synchronizers:
        - None

I used a debug tool to get the queue of unparker. it's obvious that task at queue top should run a long time ago (it's time is 84137649857516 while current time is 84707647860354):

[arthas@593599]$ ognl '@[email protected]()'
@String[java.util.concurrent.ScheduledThreadPoolExecutor@154b1210[Running, pool size = 1, active threads = 0, queued tasks = 30027, completed tasks = 413674]]
[arthas@593599]$ ognl '@System@nanoTime()'
@Long[84707647860354]
[arthas@593599]$ ognl '@[email protected]()'
@ScheduledFutureTask[
    sequenceNumber=@Long[1227676],
    time=**@Long[84137649857516]**,
    period=@Long[0],
    outerTask=@ScheduledFutureTask[java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@5d021eb6[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@7d5f12fb[Wrapped task = java.lang.VirtualThread$$Lambda/0x00007f1bf357cc70@56d5ca23]]],
    heapIndex=@Integer[0],
    this$0=@ScheduledThreadPoolExecutor[java.util.concurrent.ScheduledThreadPoolExecutor@154b1210[Running, pool size = 1, active threads = 0, queued tasks = 30027, completed tasks = 413674]],
    state=@Integer[0],
    NEW=@Integer[0],
    COMPLETING=@Integer[1],
    NORMAL=@Integer[2],
    EXCEPTIONAL=@Integer[3],
    CANCELLED=@Integer[4],
    INTERRUPTING=@Integer[5],
    INTERRUPTED=@Integer[6],
    callable=@RunnableAdapter[java.util.concurrent.Executors$RunnableAdapter@7d5f12fb[Wrapped task = java.lang.VirtualThread$$Lambda/0x00007f1bf357cc70@56d5ca23]],
    outcome=null,
    runner=null,
    waiters=null,
    STATE=@FieldInstanceReadWrite[VarHandle[varType=int, coord=[class java.util.concurrent.FutureTask]]],
    RUNNER=@FieldInstanceReadWrite[VarHandle[varType=java.lang.Thread, coord=[class java.util.concurrent.FutureTask]]],
    WAITERS=@FieldInstanceReadWrite[VarHandle[varType=java.util.concurrent.FutureTask$WaitNode, coord=[class java.util.concurrent.FutureTask]]],

Am I doing something wrong or this is a bug of jdk or os?

---- update ----

I find out when using DatagramChannel to receive packet, the triggertime of scheudleTask can be negative:

ts=2024-08-06 14:32:36;thread_name=ForkJoinPool-1-worker-1;id=97;is_daemon=true;priority=5;TCCL=jdk.internal.loader.ClassLoaders$AppClassLoader@5c647e05
    @java.util.concurrent.ScheduledThreadPoolExecutor.triggerTime() -- get a number close to Long.MAX_VALUE as param and output nagative number.
        at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:560)
        at java.lang.VirtualThread.scheduleUnpark(VirtualThread.java:705)
        at java.lang.VirtualThread.parkNanos(VirtualThread.java:632)
        at java.lang.System$2.parkVirtualThread(System.java:2652)
        at jdk.internal.misc.VirtualThreads.park(VirtualThreads.java:67)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:408)
        at sun.nio.ch.Poller.pollIndirect(Poller.java:137)
        at sun.nio.ch.Poller.poll(Poller.java:102)
        at sun.nio.ch.Poller.poll(Poller.java:87)
        at sun.nio.ch.DatagramChannelImpl.park(DatagramChannelImpl.java:494)
        at sun.nio.ch.DatagramChannelImpl.tryBlockingReceive(DatagramChannelImpl.java:762)
        at sun.nio.ch.DatagramChannelImpl.blockingReceive(DatagramChannelImpl.java:692)   ---- if nanos is zero, nanos = Long.MAX_VALUE, which caused overflow at ScheduledThreadPoolExecutor.schedule 
        at sun.nio.ch.DatagramSocketAdaptor.receive(DatagramSocketAdaptor.java:204)
        at java.net.DatagramSocket.receive(DatagramSocket.java:714)

Solution

  • It is caused by long overflow of java.util.concurrent.ScheduledThreadPoolExecutor#triggerTime(long, java.util.concurrent.TimeUnit)

            Thread.ofVirtual().start(() -> {
                try {
                    DatagramSocket datagramSocket = new DatagramSocket(null);
                    datagramSocket.bind(new InetSocketAddress("0.0.0.0", 0));
                   // set a breakpoint at java.lang.VirtualThread#scheduleUnpark before next line and we will find out that trigger time of unpark task is negative which may cause java.util.concurrent.ScheduledThreadPoolExecutor.ScheduledFutureTask#getDelay overflow again to a very large number.
                    datagramSocket.receive(new DatagramPacket(new byte[4096], 4096));
                } catch (IOException e) {
                    throw new RuntimeException(e);
                }
            });
            Thread.sleep(10000000);
    

    After setting a timeout of datagramSocket(so unpark nano won't be Long.MAX_VALUE), all problems goes away.


    I filed a bug report for jdk:

    Calling java.net.DatagramSocket#receive without soTimeout in virtual thread would lead to virtualthread unparker scheduled after about Long.MAX_VALUE, which leading to negative time returned by java.util.concurrent.ScheduledThreadPoolExecutor#triggerTime(long) in java.lang.VirtualThread#scheduleUnpark

    It may become the head of workQueue (depends on whether it overflows again when compareTo). If that happens and there are some tasks after the head, the unparker thread could blocked forever by the head task and never run some tasks after the head.

    Even worse, when first task is canceled (park condition met), the unparker thread is still parking because java.lang.VirtualThread#cancel pass false to maybeInterruptIfRunning.

    --- Edit ---

    The stucking of ScheduledThreadPoolExecutor stucking should only happens under heavy load.

    Assuming there is a existing task in queue scheduled at currentTime1 after delay1 , if the negative time want to be the head of queue. The following equation need to be satisfied: Long.MAX(aka. delay2) + currentTime2 - (currentTime1 + delay1) < 0 according the the implentation of java.util.concurrent.ScheduledThreadPoolExecutor.ScheduledFutureTask#compareTo

    So currentTime2 - (currentTime1 + delay1) > 0 need to be true in order to let Long.MAX overflow. Because currentTime2 equals to currentTime when enquing task2, currentTime1 + delay1 must be less than current time to let above equation to be true which means task1 should have been polled out of queue if not under heavy load.