Search code examples
cprofilingmicrocontrollercortex-mnxp-microcontroller

Different clock cycles values for the same (repeated) code


I'm trying to profile some algorithms on my NXP LPC11U37H board (ARM Cortex-M0), as I want to know how many clock cycles are required to execute specific algorithms.

I wrote these simple macros to do some profiling:

#define START_COUNT clock_cycles = 0;\
Chip_TIMER_Enable(LPC_TIMER32_1);\
Chip_TIMER_Reset(LPC_TIMER32_1);\

#define STOP_COUNT Chip_TIMER_Disable(LPC_TIMER32_1);\

#define GET_COUNT clock_cycles = Chip_TIMER_ReadCount(LPC_TIMER32_1);\
myprintf("%d\n\r", clock_cycles);\

Basically, START_COUNT resets the clock_cycles variable, and enable and resets the counter, which is configured to count at the same frequency of the microcontroller (48MHz). STOP_COUNT stops the timer, while GET_COUNT reads the timer value and prints it using UART (myprintf() is just a loop which sends characters via the serial port).

When I want to profile some algorithms, I just do something like this:

START_COUNT;
algorithm();
STOP_COUNT;
GET_COUNT;

Everything works, but it seems that something is wrong. Indeed I tried to profile this code:

START_COUNT;
for (volatile int i = 0; i < 1000; i++);
STOP_COUNT;
GET_COUNT;

START_COUNT;
for (volatile int i = 0; i < 1000; i++);
STOP_COUNT;
GET_COUNT;

START_COUNT;
for (volatile int i = 0; i < 1000; i++);
STOP_COUNT;
GET_COUNT;

And I got the following clock cycles values:

21076
19074
21074

That's something strange, as the compiler is configured to not optimize anything (GCC -O0, in Debug mode). Therefore, I checked the assembly code of the three code blocks, and they are exactly the same (except for memory addresses, etc. You can check it here: http://pastebin.com/raw/x6tbi3Mr - If you see some ISB/DSB instrunction, it's because I was trying to fix this behavior, but it didn't work).

Moreover, I disabled any interrupts.

I wonder what's the problem. Is there something that I'm not considering?


Solution

  • Okay had some fun, made a simple example for you. First off, every year goes by and new developers come along that don't know who Michael Abrash is, the world has changed yes, the tools are better, hardware, lots of folks that can tune things. But the zen of assembly language is IMO very relevant, particularly this question.

    https://github.com/jagregory/abrash-zen-of-asm

    the 8088 was old news when the book came out and performance tuning on it today is even less relevant. But if that is all you see in this book you are missing out. I used what I learned there below and use it every day beating up on logic, chips and boards...making them perform and/or making them break.

    And the point of this answer, is not necessarily to show how to profile something, although it will, as you are also already profiling something. But it helps to show that it is not as straightforward as you would expect, there are factors other than just the C code you have written. Placement of the C code in flash, flash vs ram, wait states or not, prefetch if you have it, branch prediction if you have it, all make a big difference. I might even demonstrate the same sequence of instructions with a different alignment change the results. Be happy you don't have a cache on the cortex-m0, that takes chaos and squares it...

    I have NXP chips here somewhere, and I have at least one cortex-m0+ nearby, but chose a cortex-m0 from st. STM32F030K6T6 as it was already wired up and ready to play with. has a built in 8Mhz oscillator and a pll to multiply that so first play with 8Mhz then with 48. It does not have the four different wait states as your chip, it has two choices, less than or equal to 24Mhz or greater than that (up to 48). But it does have a prefetch, which yours may not have.

    You might have a systick timer, something the chip vendor can choose to compile in or not. They are always at the same address (in the cortex-ms so far if present)

    #define STK_CSR 0xE000E010
    #define STK_RVR 0xE000E014
    #define STK_CVR 0xE000E018
    #define STK_MASK 0x00FFFFFF
        PUT32(STK_CSR,4);
        PUT32(STK_RVR,0xFFFFFFFF);
        PUT32(STK_CVR,0x00000000);
        PUT32(STK_CSR,5);
        //count down.
    

    PUT32 is an abstraction, long story wont get into it here

    .thumb_func
    .globl PUT32
    PUT32:
        str r1,[r0]
        bx lr
    

    now add a test function

    .align 8
    .thumb_func
    .globl TEST
    TEST:
        ldr r3,[r0]
    test_loop:
        sub r1,#1
        bne test_loop
        ldr r2,[r0]
        sub r3,r2
        mov r0,r3
        bx lr
    

    the simplest one is to read the time, do the loop the passed in number of times, then read the time and subtract to get the delta in time. and return that. shortly will add nops between the loop top and the subtract.

    with the align I am forcing the start of the function:

    08000100 <TEST>:
     8000100:   6803        ldr r3, [r0, #0]
    
    08000102 <test_loop>:
     8000102:   3901        subs    r1, #1
     8000104:   d1fd        bne.n   8000102 <test_loop>
     8000106:   6802        ldr r2, [r0, #0]
     8000108:   1a9b        subs    r3, r3, r2
     800010a:   1c18        adds    r0, r3, #0
     800010c:   4770        bx  lr
     800010e:   46c0        nop         ; (mov r8, r8)
     8000110:   46c0        nop         ; (mov r8, r8)
     8000112:   46c0        nop         ; (mov r8, r8)
    

    btw, thanks for asking this question, I didnt realize my example code for this chip, wasnt setting the flash wait states right for 48MHz...

    so at 8mhz I can play with the four combinations, fast and slow flash setting with and without the prefetch enabled.

    PUT32(FLASH_ACR,0x00);
    ra=TEST(STK_CVR,1000);
    hexstring(ra);
    ra=TEST(STK_CVR,1000);
    hexstring(ra);
    PUT32(FLASH_ACR,0x10);
    ra=TEST(STK_CVR,1000);
    hexstring(ra);
    ra=TEST(STK_CVR,1000);
    hexstring(ra);
    PUT32(FLASH_ACR,0x01);
    ra=TEST(STK_CVR,1000);
    hexstring(ra);
    ra=TEST(STK_CVR,1000);
    hexstring(ra);
    PUT32(FLASH_ACR,0x11);
    ra=TEST(STK_CVR,1000);
    hexstring(ra);
    ra=TEST(STK_CVR,1000);
    hexstring(ra);
    

    so the TEST function as written above using the 8mhz internal no pll.

    00000FA0
    00000FA0
    00000FA0
    00000FA0
    00001B56
    00001B56
    00000FA2
    00000FA2
    

    and then add more nops in the test loop

    add one nop
    00001388
    00001388
    00001388
    00001388
    00001F3F
    00001F3F
    00001389
    00001389
    
    two nops
    
    00001770
    00001770
    00001770
    00001770
    0000270E
    0000270E
    00001B57
    00001B57
    
    three nops
    
    00001B58
    00001B58
    00001B58
    00001B58
    00002AF7
    00002AF7
    00002133
    00002133
    
    eight nops
    
    00002EE0
    00002EE0
    00002EE0
    00002EE0
    00004A36
    00004A36
    000036AE
    000036AE
    
    9
    
    000032C8
    000032C8
    000032C8
    000032C8
    00004E1F
    00004E1F
    00003A96
    00003A96
    
    10
    
    000036B0
    000036B0
    000036B0
    000036B0
    000055EE
    000055EE
    00003E7E
    00003E7E
    
    11
    
    
    00003A98
    00003A98
    00003A98
    00003A98
    000059D7
    000059D7
    00004266
    00004266
    
    
    12
    
    00003E80
    00003E80
    00003E80
    00003E80
    000061A6
    000061A6
    0000464E
    0000464E
    
    16
    
    00004E20
    00004E20
    00004E20
    00004E20
    00007916
    00007916
    000055EE
    000055EE
    
    no wait state speeds
    
    0x0FA0 = 4000  0
    0x1388 = 5000  1
    0x1770 = 6000  2
    0x1B58 = 7000  3
    
    0x2EE0 = 12000 8
    
    0x4E20 = 20000 16
    
    
    slow flash times
    
    0x1B56 = 6998   0
    0x1F3F = 7999   1
    0x270E = 9998   2
    0x2AF7 = 10999  3
    0x4A36 = 18998  8
    0x4E1F = 19999  9
    0x55EE = 21998  10
    0x59D7 = 22999  11
    0x61A6 = 24998  12
    
    0x7916 = 30998
    

    so for this chip the no wait states with or without prefetch is the same, and is linear as far as I tested. add a nop you add 1000 clocks. now why is no nop which is a subtract and a branch if not equal 4 instructions per loop and not 2. that could be pipeline or that could be the amba/axi bus, long gone are the days where an cpu bus is just an address and some strobes (okay the wishbone designs on opencores). you can download the amba/axi stuff from arms website to see what is going on there, so this could be pipeline or this could be a side effect of the bus, my guess is pipeline.

    now the slow flash setting is the most interesting so far. the no nop loop is basically 7000 clocks instead of 4000 so it feels like there are 3 more wait states in there per instruction. each nop gives us 1000 more clocks so that is okay. until we go from 9 to 10 nops, that costs us 2000, then from 11 to 12 is another 2000. so unlike the no wait state version this is non-linear, is it because the prefetching of instructions pushes across a boundary?

    So if I take a detour here, and between the TEST label and the load of the time stamp into r3 I add a nop, that should also push the alignment of the back end of the loop. But that doesn't change the time for 8 nops in the loop. adding a second nop up front to push the alignment also does not change the time. So much for that theory.

    switching to 48MHz.

    slow, no prefetch
    00001B56
    00001B56
    slow, with prefetch
    00000FA0
    00000FA2
    
    9 wait states
    
    00004E1F
    00004E1F
    00003A96
    00003A96
    
    10 wait states
    
    000055EE
    000055EE
    00003E7E
    00003E7E
    

    no real surprise there. I shouldn't run with the fast flash setting so this is slow with and without prefetch. and the speeds are the same relative to the timer which is based on the clock the whole chip runs at. and we see the same interesting case that there is a non-linear step in the performance. Remember/understand even though it is the same number of clock cycles in this case this clock is 6 times faster, so this code is running 6 times faster than the 8MHz. Should be obvious, but don't forget to factor that into profiling.

    I guess the interesting thing is that with prefetch enabled we are getting the 0xFA0 number. Understand that prefetching sometimes helps and sometimes hurts, probably not too hard to create a benchmark that demonstrates it helping and not helping or not helping in a linear way. we don't know how this hardware works but if the prefetch is say 4 words, the first word is at the 3 wait states but the next three are at one wait state. but what if my code is doing some hopping thing

    b one
    nop
    nop
    nop
    one:
    b two
    nop
    nop
    nop
    two:
    

    and so on. not knowing how the hardware works each of those branch destinations would be 6 clocks to fetch with prefetch, they might only be 3 clocks without, who knows...like a cache there is that time penalty of the extra stuff you read and don't use. does the cache hits outweight the stuff read and not used? same here do the prefetch timing gains outweigh the prefetched stuff not used?

    last thing before leaving you to it, if I take the code with zero nops, and there are many ways to do this, but if I just brute force it into sram in a self modifying code fasion (or bootloader fashion if you will) and then branch to it

        ra=0x20000800;
        PUT16(ra,0x6803); ra+=2;
        PUT16(ra,0x3901); ra+=2;
        PUT16(ra,0xd1fd); ra+=2;
        PUT16(ra,0x6802); ra+=2;
        PUT16(ra,0x1a9b); ra+=2;
        PUT16(ra,0x1c18); ra+=2;
        PUT16(ra,0x4770); ra+=2;
        PUT16(ra,0x46c0); ra+=2;
        PUT16(ra,0x46c0); ra+=2;
        PUT16(ra,0x46c0); ra+=2;
        PUT16(ra,0x46c0); ra+=2;
        PUT16(ra,0x46c0); ra+=2;
        PUT16(ra,0x46c0); ra+=2;
    
        ra=branchto(STK_CVR,1000,0x20000801);
        hexstring(ra);
        ra=branchto(STK_CVR,1000,0x20000801);
        hexstring(ra);
    
    .thumb_func
    .globl branchto
    branchto:
        bx r2
    
    00000FA2
    00000FA0 
    

    this is at 48Mhz btw. I get the 0xFA0 number we would see with no wait states and/or with prefetch on. I didnt try any more experiments after this but I suspect running from ram will not have any cliffs in the performance it will be linear for a simple test like this one. it will be your best performance. but you dont have a lot of it usually relative to flash.

    when you have chips like the one you have, and when you play with relative clocks. In this case for example at 8MHz, we have a loop taking 0xFA0 or 4000 clocks. 500us. at 48mhz we start at 146us and get up to 83us. but that same 4000 clocks with no prefetch at 24MHz is expected to be 167us at 25Mhz 280us without prefetch, faster clock noticeably slower performance, because we had to add those wait states. your chip with four different wait state settings, (or any of these microcontrollers with flash that cannot run the full range of speeds without wait states) when you are at the top clock rate of a wait state setting, and then just over the edge at the next wait state setting slowest clock for that setting there is a performance hit. ideally to push performance (and not caring about power consumption and other issues) you want to be running at the max clock speed for the target wait state setting.

    these cortex-m0's are about as simple as it gets, when you say use a cortex-m4, with i and d caches, a much wider clock range, a mini mmu I think and other things. profiling becomes difficult to impossible, move the same instructions around in memory and your performance can change from not at all to 10 or 20 percent either way. change one line of code at the high level or add a single instruction within the code, and again you can see anywhere from little to large changes in performance. which means you cannot tune for this, you cant just say these 100 lines of code go this fast and then modify code around them and assume they will continue to go this fast. putting them in a function doesn't help, that function moves around as well when you add or remove things in the rest of the program, changing its performance. At best you have to do what I have demonstrated here and have more control exactly where that code is placed so that function is always there. And that still doesn't give you repeatable performance on platforms with caches, as what happens between each call to that function affects what is and isn't in the caches and how that function performs as a result.

    And this is assembly code, not compiled C that I tested. The compiler adds yet another wrinkle to this. Some folks assume that the same C code always produces the same machine code. Certainly not true, first off optimization. Also also understand that one compiler vs another do not generate the same code, or you cant assume that, gcc vs llvm/clang for example. Likewise different versions of the same compiler, gcc 3.x, 4.x and so on, with the case of gcc even the subversions vary sometimes widely in performance with everything else held constant (same source and same build commands), and it is not true that newer versions produce faster code, gcc has not followed that trend, generic compilers don't work well for any particular platform. And the things they add from one version to the next are not all about performance of the output. Gcc being distributed as source with lots of build knobs you can make multiple builds of the same version of gcc, with different build options, and I would bet you could end up with different results in things built by those two compilers that report the same version, all other things held equal.

    With experience it becomes sometimes quite easy to take the same code and make its performance change on the same hardware. Or make tiny modifications that you wouldn't think would make a difference but do. Or if you have access to the logic, you can create programs to perform a task that have significantly different performance times. It all starts with a book like zen of assembly or some others to open your eyes to these simple things, fast forward 20 years with dozens of hardware performance gizmos, each one of them helping sometimes and hurting others. As Abrash says in so many words sometimes you have to try something crazy and time it just to see, you may end up with something that performs much better.

    So I don't know what your goal was with this microcontroller, but you will need to continue to re-profile your code as you go, do not assume the first time is the final answer. Every time you change anything from any line of source code to the compiler options or version, there is a change the performance will change significantly. Allow for a wide margin in your design, and or test and tune every version.

    What you are seeing is not necessarily a surprise. Again with Abrash, it could also just be how you are using that timer...Understand your tools and make sure your timer works the way you expect. Or it could be something else.