Search code examples
armcpu-architecturebare-metalcpu-speed

Number of instructions used ARMv7


I am trying to figure out how many CPU cycles will be used to execute the delay function

delay:
 subs r0, #1
 bmi end_delay
 b delay
 end_delay:
 bx lr

I feel intuitively that 1 CPU cycle should be used for each instruction, so if we began with r0 =4 it would take 11 CPU cycles to complete the following code is that correct ?


Solution

  • The cortex-m is not the same as a microchip pic chip, (or z80 and some others) you cannot create a predictable delay this way with this instruction set. You can insure it will be at OR SLOWER but not right at some amount of time (clocks).

    0000009c <hello>:
      9c:   3801        subs    r0, #1
      9e:   d1fd        bne.n   9c <hello>
    

    your loop has a branch decision in there, more instructions and more paths basically so the opportunity for execution time to vary gets worse.

    00000090 <delay>:
      90:   3801        subs    r0, #1
      92:   d400        bmi.n   96 <end_delay>
      94:   e7fc        b.n 90 <delay>
    
    00000096 <end_delay>:
    

    so if we focus on these three instructions.

    some cortex-ms have a build (of the logic) time option of fetching per instruction or per word, the cortex-m4 documentation says:

    All fetches are word-wide.

    so we hope that halfword alignment wont affect performance. with these instructions we dont necessarily expect to see the difference anyway. with a full sized arm the fetches are multiple words so you will definitely see fetch line (size) affects.

    The execution depends heavily on the implementation. The cortex-m is just the arm core, the rest of the chip is from the chip vendor, purchased IP or built in house or a combination (very likely the latter). ARM does not make chips (other than perhaps for validation) they make IP that they sell.

    The chip vendor determines the flash (and ram) implementation, often with these types of chips the flash speed is at or slower than the cpu speed, meaning it can take two clocks to fetch one instruction which means you never feed the cpu as fast as it can go. Some like ST have a cache they put in that you cannot (so far as I know) turn off, so it is hard to see this effect (but still possible), the particular chip I am using for this says:

    8.2.3.1 Prefetch Buffer The Flash memory controller has a prefetch buffer that is automatically used when the CPU frequency is greater than 40 MHz. In this mode, the Flash memory operates at half of the system clock. The prefetch buffer fetches two 32-bit words per clock allowing instructions to be fetched with no wait states while code is executing linearly. The fetch buffer includes a branch speculation mechanism that recognizes a branch and avoids extra wait states by not reading the next word pair. Also, short loop branches often stay in the buffer. As a result, some branches can be executed with no wait states. Other branches incur a single wait state.

    and of course like ST they dont really tell you the whole story. So we just go in and try this. You can use debug timers if you want but the systick runs off the same clock and gives you the same result

    00000086 <test>:
      86:   f3bf 8f4f   dsb sy
      8a:   f3bf 8f6f   isb sy
      8e:   680a        ldr r2, [r1, #0]
    
    00000090 <delay>:
      90:   3801        subs    r0, #1
      92:   d400        bmi.n   96 <end_delay>
      94:   e7fc        b.n 90 <delay>
    
    00000096 <end_delay>:
      96:   680b        ldr r3, [r1, #0]
      98:   1ad0        subs    r0, r2, r3
      9a:   4770        bx  lr
    

    So I read the CCR and CPUID

    00000200 CCR
    410FC241 CPUID
    

    just because. then ran the code under test three times

    00000015
    00000015
    00000015
    

    these numbers are in hex so that is 21 instructions. same execution time each time so no cache or branch prediction cache effects. I didnt see anything related to branch prediction on the cortex-m4 others cortex-ms do have branch prediciton (maybe only the m7). I have the I and D cache off, they will of course, along with alignment greatly effect the execution time (and that time can/will vary as your application runs).

    I changed the alignment (add or remove nops in front of this code)

    0000008a <delay>:
      8a:   3801        subs    r0, #1
      8c:   d400        bmi.n   90 <end_delay>
      8e:   e7fc        b.n 8a <delay>
    

    and it didnt affect the execution time.

    AFAIK with this processor we cannot change the flash wait state settings directly it is automatic based on clock settings, so running at a different clock speed, above the 40Mhz mark I get

    0000001E                                                                                         
    0000001E                                                                                         
    0000001E 
    

    For the same machine code, same alignment 30 clocks now instead of 21.

    Normally the ram is faster and no wait state (understand these busses take several clocks per transaction, so it is not like the old days, but there is still a delay you can detect), so running these instructions in ram should tell us something

    for(rb=0;rb<0x20;rb+=2)
    {
    
        hexstrings(rb);
        ra=0x20001000+rb;
        PUT16(ra,0x680a); ra+=2;
        hexstrings(ra);
        PUT16(ra,0x3801); ra+=2;
        PUT16(ra,0xd400); ra+=2;
        PUT16(ra,0xe7fc); ra+=2;
        PUT16(ra,0x680b); ra+=2;
        PUT16(ra,0x1ad0); ra+=2;
        PUT16(ra,0x4770); ra+=2;
    
        PUT16(ra,0x46c0); ra+=2;
        PUT16(ra,0x46c0); ra+=2;
        PUT16(ra,0x46c0); ra+=2;
        PUT16(ra,0x46c0); ra+=2;
        PUT16(ra,0x46c0); ra+=2;
        PUT16(ra,0x46c0); ra+=2;
        hexstring(BRANCHTO(4,STCURRENT,0x20001001+rb)&STMASK);
    }
    

    and that certainly gets interesting...

    00000000 20001002 00000026                                                                       
    00000002 20001004 00000020                                                                       
    00000004 20001006 00000026                                                                       
    00000006 20001008 00000020                                                                       
    00000008 2000100A 00000026                                                                       
    0000000A 2000100C 00000020                                                                       
    0000000C 2000100E 00000026                                                                       
    0000000E 20001010 00000020                                                                       
    00000010 20001012 00000026                                                                       
    00000012 20001014 00000020                                                                       
    00000014 20001016 00000026                                                                       
    00000016 20001018 00000020                                                                       
    00000018 2000101A 00000026                                                                       
    0000001A 2000101C 00000020                                                                       
    0000001C 2000101E 00000026                                                                       
    0000001E 20001020 00000020 
    

    first off it is 32 or 38 clocks, second is there is an alignment effect

    The armv7-m CCR shows a branch prediction bit, but the trm and the vendor documentation dont show it, so it could be a generic thing that not all cores support.

    So for a specific cortex-m4 chip the time to execute your loop is between 21 and 38 clocks, and I could probably make it slower if I wanted to. I dont think I could get it down to 11 on this chip though.

    If you are for example doing i2c bit banging you can use something like this for a delay that will work fine, wont be optimal but will work just fine. If you need something more precise within a window of time at least this but not greater than than then use a timer (and understand polled or interrupt your accuracy will have some error) if the timer peripheral or other can generate the signal you want you can then get down to a clock accurate waveform (if that is what your delay is for).

    another cortex-m4 is expected to have different results, I would expect an stm32 to have the sram be same as or faster than flash, not slower as in this case. And there are settings you can mess with that your init code if you are relying on someone else to setup your chip, that can/will affect execution time.

    EDIT

    I dont know where I got the idea this was for a cortex-m4 which is an armv7-m, so I didnt have a raspberry pi 2 handy, but had a pi3, and running in aarch32 mode, 32 bit instructions. I had no idea how much work this would be to get the timers running and then the cache enabled. The pi runs out of dram which is very inconsistent even with bare metal. So I figured I would enable the l1 cache, and after the first run it should be all in cache and consistent. Now that I think about it there are four cores and each is running, dont know how to disable them the other three are spinning in a loop waiting for a mailbox register to tell them what code to run. perhaps I need to have them branch somewhere and run out of l1 cache as well...not sure if the l1 is per core or shared, I think I looked that up at one point.

    Anyway code under test

    000080c8 <COUNTER>:
        80c8:   ee192f1d    mrc 15, 0, r2, cr9, cr13, {0}
    
    000080cc <delay>:
        80cc:   e2500001    subs    r0, r0, #1
        80d0:   4a000000    bmi 80d8 <end_delay>
        80d4:   eafffffc    b   80cc <delay>
    
    000080d8 <end_delay>:
        80d8:   ee193f1d    mrc 15, 0, r3, cr9, cr13, {0}
        80dc:   e0430002    sub r0, r3, r2
        80e0:   e12fff1e    bx  lr
    

    and the punch line is for that alignment the first column is the r0 passed, the next three are three runs, the last column if there is the delta from the prior run to current (the cost of an extra count value in r0)

    00000000 0000000A 0000000A 0000000A 
    00000001 00000014 00000014 00000014 0000000A 
    00000002 0000001E 0000001E 0000001E 0000000A 
    00000003 00000028 00000028 00000028 0000000A 
    00000004 00000032 00000032 00000032 0000000A 
    00000005 0000003C 0000003C 0000003C 0000000A 
    00000006 00000046 00000046 00000046 0000000A 
    00000007 00000050 00000050 00000050 0000000A 
    00000008 0000005A 0000005A 0000005A 0000000A 
    00000009 00000064 00000064 00000064 0000000A 
    0000000A 0000006E 0000006E 0000006E 0000000A 
    0000000B 00000078 00000078 00000078 0000000A 
    0000000C 00000082 00000082 00000082 0000000A 
    0000000D 0000008C 0000008C 0000008C 0000000A 
    0000000E 00000096 00000096 00000096 0000000A 
    0000000F 000000A0 000000A0 000000A0 0000000A 
    00000010 000000AA 000000AA 000000AA 0000000A 
    00000011 000000B4 000000B4 000000B4 0000000A 
    00000012 000000BE 000000BE 000000BE 0000000A 
    00000013 000000C8 000000C8 000000C8 0000000A 
    

    then to make alignment checking easier which I didnt need to do in the end had it try different alignments for the above code (address in first column) and the results for a r0 of four.

    00010000 00000032 00010004 0000002D 00010008 00000032 0001000C 0000002D

    this repeats up to address 0x101FC

    If I change the alignment in the compiled test

    000080cc <COUNTER>:
        80cc:   ee192f1d    mrc 15, 0, r2, cr9, cr13, {0}
    
    000080d0 <delay>:
        80d0:   e2500001    subs    r0, r0, #1
        80d4:   4a000000    bmi 80dc <end_delay>
        80d8:   eafffffc    b   80d0 <delay>
    
    000080dc <end_delay>:
        80dc:   ee193f1d    mrc 15, 0, r3, cr9, cr13, {0}
        80e0:   e0430002    sub r0, r3, r2
        80e4:   e12fff1e    bx  lr
    

    then it is a wee bit faster.

    00000000 00000009 00000009 00000009 
    00000001 00000012 00000012 00000012 00000009 
    00000002 0000001B 0000001B 0000001B 00000009 
    00000003 00000024 00000024 00000024 00000009 
    00000004 0000002D 0000002D 0000002D 00000009 
    00000005 00000036 00000036 00000036 00000009 
    00000006 0000003F 0000003F 0000003F 00000009 
    00000007 00000048 00000048 00000048 00000009 
    00000008 00000051 00000051 00000051 00000009 
    00000009 0000005A 0000005A 0000005A 00000009 
    0000000A 00000063 00000063 00000063 00000009 
    0000000B 0000006C 0000006C 0000006C 00000009 
    0000000C 00000075 00000075 00000075 00000009 
    0000000D 0000007E 0000007E 0000007E 00000009 
    0000000E 00000087 00000087 00000087 00000009 
    0000000F 00000090 00000090 00000090 00000009 
    00000010 00000099 00000099 00000099 00000009 
    00000011 000000A2 000000A2 000000A2 00000009 
    00000012 000000AB 000000AB 000000AB 00000009 
    00000013 000000B4 000000B4 000000B4 00000009 
    

    if I change it to be a function call

    000080cc <COUNTER>:
        80cc:   e92d4001    push    {r0, lr}
        80d0:   ee192f1d    mrc 15, 0, r2, cr9, cr13, {0}
        80d4:   eb000003    bl  80e8 <delay>
        80d8:   ee193f1d    mrc 15, 0, r3, cr9, cr13, {0}
        80dc:   e8bd4001    pop {r0, lr}
        80e0:   e0430002    sub r0, r3, r2
        80e4:   e12fff1e    bx  lr
    
    000080e8 <delay>:
        80e8:   e2500001    subs    r0, r0, #1
        80ec:   4a000000    bmi 80f4 <end_delay>
        80f0:   eafffffc    b   80e8 <delay>
    
    000080f4 <end_delay>:
        80f4:   e12fff1e    bx  lr
    
    00000000 0000001A 0000001A 0000001A 
    00000001 00000023 00000023 00000023 00000009 
    00000002 0000002C 0000002C 0000002C 00000009 
    00000003 00000035 00000035 00000035 00000009 
    00000004 0000003E 0000003E 0000003E 00000009 
    00000005 00000047 00000047 00000047 00000009 
    00000006 00000050 00000050 00000050 00000009 
    00000007 00000059 00000059 00000059 00000009 
    00000008 00000062 00000062 00000062 00000009 
    00000009 0000006B 0000006B 0000006B 00000009 
    0000000A 00000074 00000074 00000074 00000009 
    0000000B 0000007D 0000007D 0000007D 00000009 
    0000000C 00000086 00000086 00000086 00000009 
    0000000D 0000008F 0000008F 0000008F 00000009 
    0000000E 00000098 00000098 00000098 00000009 
    0000000F 000000A1 000000A1 000000A1 00000009 
    00000010 000000AA 000000AA 000000AA 00000009 
    00000011 000000B3 000000B3 000000B3 00000009 
    00000012 000000BC 000000BC 000000BC 00000009 
    00000013 000000C5 000000C5 000000C5 00000009 
    

    the cost per count is the same but the call overhead is more expensive

    this allows me to use thumb mode just for fun, to avoid the mode change the linker added I made it a little faster (and consistent).

    000080cc <COUNTER>:
        80cc:   e92d4001    push    {r0, lr}
        80d0:   e59f103c    ldr r1, [pc, #60]   ; 8114 <edel+0x2>
        80d4:   e59fe03c    ldr lr, [pc, #60]   ; 8118 <edel+0x6>
        80d8:   ee192f1d    mrc 15, 0, r2, cr9, cr13, {0}
        80dc:   e12fff11    bx  r1
    
    000080e0 <here>:
        80e0:   ee193f1d    mrc 15, 0, r3, cr9, cr13, {0}
        80e4:   e8bd4001    pop {r0, lr}
        80e8:   e0430002    sub r0, r3, r2
        80ec:   e12fff1e    bx  lr
    
    000080f0 <delay>:
        80f0:   e2500001    subs    r0, r0, #1
        80f4:   4a000000    bmi 80fc <end_delay>
        80f8:   eafffffc    b   80f0 <delay>
    
    000080fc <end_delay>:
        80fc:   e12fff1e    bx  lr
        8100:   e1a00000    nop         ; (mov r0, r0)
        8104:   e1a00000    nop         ; (mov r0, r0)
        8108:   e1a00000    nop         ; (mov r0, r0)
    
    0000810c <del>:
        810c:   3801        subs    r0, #1
        810e:   d400        bmi.n   8112 <edel>
        8110:   e7fc        b.n 810c <del>
    
    00008112 <edel>:
        8112:   4770        bx  lr
    
    00000000 000000F4 0000001B 0000001B 
    00000001 00000024 00000024 00000024 00000009 
    00000002 0000002D 0000002D 0000002D 00000009 
    00000003 00000036 00000036 00000036 00000009 
    00000004 0000003F 0000003F 0000003F 00000009 
    00000005 00000048 00000048 00000048 00000009 
    00000006 00000051 00000051 00000051 00000009 
    00000007 0000005A 0000005A 0000005A 00000009 
    00000008 00000063 00000063 00000063 00000009 
    00000009 0000006C 0000006C 0000006C 00000009 
    0000000A 00000075 00000075 00000075 00000009 
    0000000B 0000007E 0000007E 0000007E 00000009 
    0000000C 00000087 00000087 00000087 00000009 
    0000000D 00000090 00000090 00000090 00000009 
    0000000E 00000099 00000099 00000099 00000009 
    0000000F 000000A2 000000A2 000000A2 00000009 
    00000010 000000AB 000000AB 000000AB 00000009 
    00000011 000000B4 000000B4 000000B4 00000009 
    00000012 000000BD 000000BD 000000BD 00000009 
    00000013 000000C6 000000C6 000000C6 00000009
    

    with this alignment

    0000810e <del>:
        810e:   3801        subs    r0, #1
        8110:   d400        bmi.n   8114 <edel>
        8112:   e7fc        b.n 810e <del>
    
    00008114 <edel>:
        8114:   4770        bx  lr
    
    
    00000000 0000007E 0000001C 0000001C 
    00000001 00000026 00000026 00000026 0000000A 
    00000002 00000030 00000030 00000030 0000000A 
    00000003 0000003A 0000003A 0000003A 0000000A 
    00000004 00000044 00000044 00000044 0000000A 
    00000005 0000004E 0000004E 0000004E 0000000A 
    00000006 00000058 00000058 00000058 0000000A 
    00000007 00000062 00000062 00000062 0000000A 
    00000008 0000006C 0000006C 0000006C 0000000A 
    00000009 00000076 00000076 00000076 0000000A 
    0000000A 00000080 00000080 00000080 0000000A 
    0000000B 0000008A 0000008A 0000008A 0000000A 
    0000000C 00000094 00000094 00000094 0000000A 
    0000000D 0000009E 0000009E 0000009E 0000000A 
    0000000E 000000A8 000000A8 000000A8 0000000A 
    0000000F 000000B2 000000B2 000000B2 0000000A 
    00000010 000000BC 000000BC 000000BC 0000000A 
    00000011 000000C6 000000C6 000000C6 0000000A 
    00000012 000000D0 000000D0 000000D0 0000000A 
    00000013 000000DA 000000DA 000000DA 0000000A 
    

    so in some ideal world on this processor assuming a cache hit on the delay code

    00000004 00000032 00000032 00000032 0000000A 
    00000004 0000002D 0000002D 0000002D 00000009 
    00000004 0000003E 0000003E 0000003E 00000009 
    00000004 0000003F 0000003F 0000003F 00000009 
    00000004 00000044 00000044 00000044 0000000A 
    

    between 0x2D and 0x44 clocks to run that loop with r0 = 4

    Realistically on this platform without the cache enabled and/or what you might see if you get a cache miss.

    00000000 0000030B 000002B7 000002ED 
    00000001 0000035B 00000389 000003E9 
    00000002 000003FB 00000439 0000041B 
    00000003 0000058F 000004E7 0000055B 
    00000004 000005FF 0000069D 000006D1 
    00000005 00000745 00000733 000006F7 
    00000006 00000883 00000817 00000801 
    00000007 00000873 00000853 0000089B 
    00000008 00000923 00000B05 0000092F 
    00000009 00000A3F 000009A9 00000B4D 
    0000000A 00000B79 00000BA9 00000C57 
    0000000B 00000C21 00000D13 00000B51 
    0000000C 00000C0B 00000E91 00000DE9 
    0000000D 00000D97 00000E0D 00000E81 
    0000000E 00000E5B 0000100B 00000F25 
    0000000F 00001097 00001095 00000F37 
    00000010 000010DB 000010FD 0000118B 
    00000011 00001071 0000114D 0000123F 
    00000012 000012CF 0000126D 000011DB 
    00000013 0000140D 0000143D 0000141B 
    000002B7 0000143D 
    

    the r0=4 line

    00000004 000005FF 0000069D 000006D1 
    

    thats a lot of cpu counts...

    Hopefully I have put this topic to bed. While it is interesting to try to assume how fast code runs or how many counts, etc...It is not that simple on these types of processors, pipelines, caches, branch prediction, complicated system busses, using a common-ish core in various chip implementations where the chip vendor manages the memory/flash separate from the processor IP vendors code.

    I didnt mess with branch prediction on this second experiment, had I done that then alignment would not be so consistent, depending on how branch prediction is implemented it can vary its usefulness based on where the branch is relative to the fetch line as the next fetch has started or not or is a certain way through when the branch predictor determines it doesnt need to do that fetch and/or starts the branched fetch, in this case the branch is two ahead so you might not see it with this code, you would want some nops sprinkled in between so that the bmi destination is in a separate fetch line (in order to see the difference).

    And this is the easy stuff to manipulate, using the same machine code sequences and seeing those vary in execution time by what did we see. between 0x3F and 0x6D1 that is over 27x difference between fastest and slowest...for the same machine code. changing the alignment of the code by one instruction (somewhere else in unrelated code has one more or one fewer instructions from a prior build) was 5 counts difference.

    to be fair the mrc at the end of the test was probably part of the time

    000080c8 <COUNTER>:
        80c8:   ee192f1d    mrc 15, 0, r2, cr9, cr13, {0}
        80cc:   ee193f1d    mrc 15, 0, r3, cr9, cr13, {0}
        80d0:   e0430002    sub r0, r3, r2
        80d4:   e12fff1e    bx  lr
    

    resulted in a count of 1 with either alignment. so doesnt guarantee that it was only one count of error in the measurement, but likely wasnt a dozen.

    Anyway, I hope this helps your understanding.