Search code examples

Using WDT to make detect hung code in embedded system, specifically STM32

I'm designing a bare-metal STM32 firmware that must detect hung/lost code and reset. My approach is to have each of the interrupt base processes (basically interrupt-driven code) increment its own global variable as it runs, then in a highest priority 'supervisor' task, check to ensure each global variable is changing. If any one has stopped changing, then allow the WDT to reset the board.

Does this sound like a sound approach? Any better ideas?


  • Assuming a foreground/background "super-loop" architecture, with interrupt handlers and a single main thread, then I would suggest a better method would be to implement timeouts for each interrupt.

    For example assuming you have implemented a basic system tick interface (using SYSTICK on Cortex-M), with a function tickms() returning elapsed time in milliseconds. Then for each interrupt being watch-dogged you might have an enumeration such as:

    typedef enum
    } eWdg ;

    Then an array such as:

    volatile struct
        unsigned period ;
        unsigned timestamp ;
    } wdg[NUMBER_OF_WDG] =
        {1000, 0}, // WDG_UART1
        {1000, 0,  // WDG_UART2
        {100,  0}  // WDG_TIMER1

    and an API:

    void wdgReset( eWdg wdg_id )
        wdg[wdg_id].timestamp = tickms() ;
    void wdgCheck()
        for( int i = 0; i < NUMBER_OF_WDG; i++ )
            while( tickms() - wdg[i].timestamp > wdg[i].period )
                // spin while timeout expired until 
                // interrupt recovers or hardware watchdog
                // fires
            resetHwWatchdog() ;

    Then each interrupt resets its timeout via wdgReset(), and the main loop, continuously checks the software watchdogs thus:

    int main()
            // do any background processing here
            backgroundTasks() ;
            // Check interrupts


    • if the main thread stalls, the hardware watchdog will fire,
    • if any interrupt stalls, the hardware watchdog will fire,
    • if an interrupt stops firing, the wdgCheck() will spin and the hardware watchdog will fire,
    • if everything is working normally, the software watchdogs will be reset, and the hardware watchdog will be serviced by wdgCheck(),
    • every task or interrupt has its own specific period suited to its expected rate in the application.

    Note on Cortex-M you can issue a software reset via the NVIC, so you could optionally reset immediately on a software watchdog expiry rather then wait for the hardware watchdog.

    Clearly this is just a pseudocode outline, and purely illustrative - it could be refined and extended in several ways. If you were to use an RTOS, you could similarly protect tasks, with the supervisor either in the idle loop or in a task having a lower priority than any other.

    One refinement I would suggest would be to have a dynamic registry of software watchdogs rather then the static array and have an API:

    tWdgHandle wdgCreate( unsigned period ) ;

    for example so that tasks and device drivers can independently add their own watchdogs, and wdgCheck() would iterate all registered handlers. A task could even modify the period dynamically as required (if it were temporarily disabled for example):

    void wdgSetPeriod( tWdgHandle wdg, unsigned period ) ;