Search code examples
cdebuggingarmstm32cortex-m

How can I debug a HardFault on my STM32H743 when the recovered stack frame does not contain plausible information?


I am currently trying to track down the reason for a HardFault that sometimes occurs on my STM32H743. I was able to narrow down the culprit to a section of code of about ~200 lines. Now of course I would like to nail it down to the exact location.

Some information that may or may not be relevant: I am programming the chip using C, using the ST HAL. I debug using GDB via ST-Linkv3 from VSCode.

Debugging is complicated by a couple of factors:

  • The error occurs very rarely. I have found a reliable method of replicating the fault, but it can take several minutes up to one hour until the error occurs.
  • The error is dependent on some hardware timing (it involves the UART peripheral). I can't just step through the code until it occurs, because when stepping through, it just won't occur.
  • The error does not seem to leave a meaningful stack frame (see below). This is actually the main topic of my question.

To find the cause of the error, I followed the instructions found here: https://interrupt.memfault.com/blog/cortex-m-hardfault-debug

The value of my CFSR (0xE000ED28) is 0x1. So an IACCVIOL. Hmm... At this point I guess it makes sense to mention that I don't have any MPU enabled. So if I understand correctly, this means that something attempted to jump code execution to a memory location that is now allowed to execute code. What are possible reasons for this?

The value of my HFSR (0xE000ED2C) is 0x40000000. So a FORCED fault, does this help me in any way?

The value of my lr register is 0xFFFFFFE9. lr & (1<<2) is 0x0, so this means msp should be the active stack pointer. Reading out a stack frame from msp (running p/a *(uint32_t[8] *)$msp) gives me:

0x672,        // r0
0x40,         // r1
0x631090,     // r2
0x20001800,   // r3
0xff          // r12
0xfc006e3f    // LR
0xfc006e3e    // pc
0x81000000    // xPSR

So if I understand correctly, LR should be the return address of the last jump before the HardFault handler. But what kind of return address is 0xfc006e3f? I guess this is part of the reason why the HardFault occurs in the first place, but how can I find out from this information where the problem was actually caused?


Solution

  • I was able to find my issue, thanks for all the suggestions in the comments!

    In the end @Ilya's comment was almost spot-on: In my code I use a library that implements the OneWire protocol over UART peripheral. This library declares a UART Receive Buffer on stack and uses it to receive data in interrupt mode. If the OneWire transaction times out for some reason, it returns an error code, but it does NOT cancel the UART receive operation. Sometimes apparently the UART receive transaction will then complete after some more time and overwrite whatever happens to be in the stack at the memory location where the UART receive buffer used to be. In my unfortunate case that meant I was left with a HardFault with no clue as to what actually caused it.

    I solved the issue by canceling the UART Receive operation in case of a OneWire timeout.