Search code examples
armcounterinterruptstm32low-latency

compensating latency on ARM interrupts?


I'm working on a project on a STM32F4 CPU, generating signals.

I have a generic timer on CPU clock (no prescaler) on a STM32 triggering interrupts on overflow, to generate a periodic signal with GPIO afterwards.

I need to trigger thr GPIO at a very precise time (basically down to one CPU cycle precision). I've managed to reduce this jitter to +-5 cycles by setting priorities & al, but this jitter exists, depending on what the CPU was doing.

I need to compensate this few cycles jitter. Adding a few cycles more latency isn't a problem as long as I toggle GPIOs at a precise time.

My idea was to read the current value of the counter, and have an active loop of FIXED_NUMBER-CURRENT_VALUE time, ensuring I would exit the loop at precise times.

However, doing a simple loop in C - being a FOR loop, or a while(counter->value < TARGET) doesn't work as it ADDS jitter instead of reducing it.

Am I doing something wrong / naive ? Should I do it in assembly ? how would that be different from C (I checked the disassembly with GCC to check loop was not optimized away nor was I hitting memory ?)

(I ensured with empty, non optimized but not hitting memory loop body)

edit : see this example on AVR (much more stable I know) See by example http://lucidscience.com/pro-vga%20video%20generator-7.aspx (search for "jitter")

edit2 : I tried a simple loop in assembly such as (r0 is my counter, number of cycles to wait, in a register)

loop : SUBS r0,#1 ; tried with 2 also
       BGE loop

and, again, jitter is better without it.

To sumit up, I already know how much I should delay. I just need a way to have a branch of code consume reliably N cycles in a case and M in another. Unfortunately, branches alone don't seem to work because a pipeline refill doesn't seem to take a reliable number of cycles, and conditional expressions don't either because they always take the same number of cycles (sometimes doing nothing).

Would running from RAM instead of flash improve consistency ? (NB stm32f4 have a flash prefetch..)


Solution

  • (It is ironic that a question about reducing response latency took three years to get an answer.)

    +/- 5 cycles sounds awfully familiar. You are likely hitting wait states accessing the Flash controller during interrupt dispatch.

    The CPU needs to do three things during interrupt dispatch:

    1. Load the vector table entry.
    2. Load the initial code of your interrupt routine.
    3. Write some of the registers out to the stack.

    If your vector table and/or interrupt routine code are in Flash, the fetches in items 1 and 2 go to Flash. When running the CPU at its highest rated speeds (up to 168MHz), accesses to Flash entail five wait states. This means that an access to Flash can take either 1 or 6 cycles, depending on whether the data being requested is in the Flash cache. If you're seeing exactly 0 or 5 cycles of latency, this is a likely culprit. This problem is most easily fixed by moving the ISR code and the vector table into RAM. You can also "fix" it by disabling the Flash cache, which will cause Flash accesses to be predictably slow.

    There is a sneakier factor that may also be biting you: if the code being interrupted is also using Flash, the interrupt dispatch may have to wait for its Flash accesses to complete, assuming it misses cache. You can fix this by also moving the interrupted code into RAM, but at this point it's starting to sound like nothing lives in Flash. There's a way to keep the code in Flash that I mention below.

    Finally, there's a yet sneakier thing: if you have other interrupts that may occur right before your latency-sensitive interrupt, it is possible for that interrupt to get -5 cycles of latency due to tail chaining.

    My solution to the second two problems I listed is a little weird: make sure the processor is idle, i.e. not taking another interrupt or fetching from Flash, when your interrupt occurs. The way I did this is by configuring a lower-priority interrupt to arrive just before my latency-sensitive interrupt (using a timer); that ISR simply executes a wait-for-interrupt instruction, wfi.

    These are surmountable problems. I disagree with the commenters that you need to abandon C and write in assembly language; my m4vgalib system contains almost no assembly language and has very low jitter.

    I discuss these very same problems and my solutions in more detail in one section of an article on my blog.