c linux network-programming linux-kernel interrupt

Workqueue starts, interrupted, and never finished causing CPU stall

Providing a minimal code example will be difficult but I'll provide some sudo code to hopefully get the point/question across.

TL;DR: My workqueue starts, is interrupted, and then never finishes causing a CPU stall.

I am creating a network driver for a PCIe device. For language, Tx=host out, Rx=host in. For the Tx side of things I'm using work queues (work_struct). So.

ndo_start_xmit(){
//Perform some operations and load a DMA.
}

request_irq(irq_handler);
INIT_WORK(work,work_handler);

irq_handler(){
//Check what caused the IRQ
if(ndo_xmit_dma caused irq){
schedule_work(work);
}
}

work_handler(){
if(xmit_called){
spin_lock()
//Do some stuff
spin_unlock()
}
}

Then for the Rx side of things it's similar but now uses NAPI instead of workqueue because I'm learning and honestly could likely move all work to napi (please state if that would solve the problem).

irq_handler(){
if(Rx caused the irq){
napi_schedule();
}
}

//Do a bunch of napi releated stuff (never try to grab the spin_lock).

So what's the problem? Well mid way through my work_handler for Tx and Rx IRQ happens (no big deal so far). The IRQ obviously bounces me out of the workqueue at which point NAPI is scheduled. Now instead of going back to the workqueue it handles the NAPI function (again not a huge deal to my program I assume this is a priority thing sure). Then my kernel calls ndo_start_xmit again gets to the spin_lock at which point the CPU stalls. At no point does the program ever go back to the scheduled but interrupted work from work_handler. In testing it actually was interrupted right between 2 print statements so I know it never even did a partial return.

So why does workqueue never return? Is there a way to solve this? My initial guess is flush_work but that feels more like a patch to the problem and not solving the root of it. Would it be better to move my Tx schedule_work to be a part of the NAPI handler instead?

Thank you for the insight.

UPDATE: This is after I accepted a perfectly good answer. In the discussion that followed I proposed multiple NAPI instances. Simply put, 1 NAPI per netdev otherwise lot's of problems show up. I wasn't able to distinguish what caused napi with just the napi struct (maybe someone sees a way I don't except abusing the budget number). As for my problem I found it was kind of a 3 step issue. The workqueue was interrupted by the RX irq/napi. The Rx napi then got blocked by a call to ndo_start_xmit. ndo_start_xmit trys to grab the same spinlock that the workqueue was using so I got stuck in a position where nothing can move hence the CPU stall.

Solution

If the area between spin_lock..spin_unlock is pretty short, spin_lock_irqsave may be applicable. At the very least, try it to see if it makes your problem go away. My suspicion is NAPI has pinned your work_handler context.

While _irqsave may happen to work, you should do a proper lock ordering analysis.

Have a look at https://www.kernel.org/doc/Documentation/locking/spinlocks.txt; particularly the bottom bit with:

spin_lock(&lock);
...
    <- interrupt comes in:
        spin_lock(&lock);