Using WDT to make detect hung code in embedded system, specifically STM32

I'm designing a bare-metal STM32 firmware that must detect hung/lost code and reset. My approach is to have each of the interrupt base processes (basically interrupt-driven code) increment its own global variable as it runs, then in a highest priority 'supervisor' task, check to ensure each global variable is changing. If any one has stopped changing, then allow the WDT to reset the board.

Does this sound like a sound approach? Any better ideas?

Solution

Assuming a foreground/background "super-loop" architecture, with interrupt handlers and a single main thread, then I would suggest a better method would be to implement timeouts for each interrupt.

For example assuming you have implemented a basic system tick interface (using SYSTICK on Cortex-M), with a function tickms() returning elapsed time in milliseconds. Then for each interrupt being watch-dogged you might have an enumeration such as:

typedef enum
{
    WDG_UART1,
    WDG_UART2,
    WDG_TIMER1,
    ...
    NUMBER_OF_WDG
} eWdg ;

Then an array such as:

volatile struct
{
    unsigned period ;
    unsigned timestamp ;
} wdg[NUMBER_OF_WDG] =
{
    {1000, 0}, // WDG_UART1
    {1000, 0,  // WDG_UART2
    {100,  0}  // WDG_TIMER1
    ...
}

and an API:

void wdgReset( eWdg wdg_id )
{
    wdg[wdg_id].timestamp = tickms() ;
}

void wdgCheck()
{
    for( int i = 0; i < NUMBER_OF_WDG; i++ )
    {
        while( tickms() - wdg[i].timestamp > wdg[i].period )
        {
            // spin while timeout expired until 
            // interrupt recovers or hardware watchdog
            // fires
        }
       
        resetHwWatchdog() ;
    }
}

Then each interrupt resets its timeout via wdgReset(), and the main loop, continuously checks the software watchdogs thus:

int main()
{
    for(;;)
    {
        // do any background processing here
        backgroundTasks() ;

        // Check interrupts
        wdgCheck()
    }
}

Then:

if the main thread stalls, the hardware watchdog will fire,
if any interrupt stalls, the hardware watchdog will fire,
if an interrupt stops firing, the wdgCheck() will spin and the hardware watchdog will fire,
if everything is working normally, the software watchdogs will be reset, and the hardware watchdog will be serviced by wdgCheck(),
every task or interrupt has its own specific period suited to its expected rate in the application.

Note on Cortex-M you can issue a software reset via the NVIC, so you could optionally reset immediately on a software watchdog expiry rather then wait for the hardware watchdog.

Clearly this is just a pseudocode outline, and purely illustrative - it could be refined and extended in several ways. If you were to use an RTOS, you could similarly protect tasks, with the supervisor either in the idle loop or in a task having a lower priority than any other.

One refinement I would suggest would be to have a dynamic registry of software watchdogs rather then the static array and have an API:

tWdgHandle wdgCreate( unsigned period ) ;

for example so that tasks and device drivers can independently add their own watchdogs, and wdgCheck() would iterate all registered handlers. A task could even modify the period dynamically as required (if it were temporarily disabled for example):

void wdgSetPeriod( tWdgHandle wdg, unsigned period ) ;