I'm designing a bare-metal STM32 firmware that must detect hung/lost code and reset. My approach is to have each of the interrupt base processes (basically interrupt-driven code) increment its own global variable as it runs, then in a highest priority 'supervisor' task, check to ensure each global variable is changing. If any one has stopped changing, then allow the WDT to reset the board.
Does this sound like a sound approach? Any better ideas?
Assuming a foreground/background "super-loop" architecture, with interrupt handlers and a single main thread, then I would suggest a better method would be to implement timeouts for each interrupt.
For example assuming you have implemented a basic system tick interface (using SYSTICK on Cortex-M), with a function tickms()
returning elapsed time in milliseconds. Then for each interrupt being watch-dogged you might have an enumeration such as:
typedef enum
{
WDG_UART1,
WDG_UART2,
WDG_TIMER1,
...
NUMBER_OF_WDG
} eWdg ;
Then an array such as:
volatile struct
{
unsigned period ;
unsigned timestamp ;
} wdg[NUMBER_OF_WDG] =
{
{1000, 0}, // WDG_UART1
{1000, 0, // WDG_UART2
{100, 0} // WDG_TIMER1
...
}
and an API:
void wdgReset( eWdg wdg_id )
{
wdg[wdg_id].timestamp = tickms() ;
}
void wdgCheck()
{
for( int i = 0; i < NUMBER_OF_WDG; i++ )
{
while( tickms() - wdg[i].timestamp > wdg[i].period )
{
// spin while timeout expired until
// interrupt recovers or hardware watchdog
// fires
}
resetHwWatchdog() ;
}
}
Then each interrupt resets its timeout via wdgReset()
, and the main loop, continuously checks the software watchdogs thus:
int main()
{
for(;;)
{
// do any background processing here
backgroundTasks() ;
// Check interrupts
wdgCheck()
}
}
Then:
wdgCheck()
will spin and the hardware watchdog will fire,wdgCheck()
,Note on Cortex-M you can issue a software reset via the NVIC, so you could optionally reset immediately on a software watchdog expiry rather then wait for the hardware watchdog.
Clearly this is just a pseudocode outline, and purely illustrative - it could be refined and extended in several ways. If you were to use an RTOS, you could similarly protect tasks, with the supervisor either in the idle loop or in a task having a lower priority than any other.
One refinement I would suggest would be to have a dynamic registry of software watchdogs rather then the static array and have an API:
tWdgHandle wdgCreate( unsigned period ) ;
for example so that tasks and device drivers can independently add their own watchdogs, and wdgCheck()
would iterate all registered handlers. A task could even modify the period dynamically as required (if it were temporarily disabled for example):
void wdgSetPeriod( tWdgHandle wdg, unsigned period ) ;