Convert C-code to ARM Cortex M3 Assembler Code

i have got the following c-function

int main_compare (int nbytes, char *pmem1, char *pmem2){
    for(nbytes--; nbytes>=0; nbytes--) {    
        if(*(pmem1+nbytes) - *(pmem2+nbytes) != 0) {
            return 0;
        }
    }
    return 1;
}

and i want to convert it into an ARM - Cortex M3 - assembler code. I'm not really good at this, and i don't have a suitable compiler to test if i do it right. But here comes what i have so far

byte_cmp_loop PROC
; assuming: r0 = nbytes, r1=pmem1, r2 = pmem2

    SUB R0, R0, #1    ; nBytes - 1 as maximal value for loop counter

_for_loop: 
    ADD R3, R1, R0    ;
    ADD R4, R2, R0    ; calculate pmem + n
    LDRB R3, [R3]     ;
    LDRB R4, [R4]     ; look at this address

    CMP R3, R4        ; if cmp = 0, then jump over return

    BE _next          ; if statement by "branch"-cmd
        MOV R0, #0    ; return value is zero
        BX LR         ; always return 0 here
_next:

    sub R0, R0, #1    ; loop counting
    BLPL _for_loop    ; pl = if positive or zero

    MOV R0, #1        ;
    BX LR             ; always return 1 here

ENDP

but i'm really not sure, if this is right, but i have no idea how to check it....

Solution

I see just 3 fairly simple problems there:

BE _next          ; if statement by "branch"-cmd
...
sub R0, R0, #1    ; loop counting
BLPL _for_loop    ; pl = if positive or zero

BEQ, not BE - condition codes are always 2 letters.
SUB alone won't update the flags - you need the suffix to say so i.e. SUBS.
BLPL would branch and link, thus overwriting your return address - you want BPL. Actually, BLPL wouldn't assemble here anyway, since in Thumb a conditional BL would need an IT to set it up (unless of course your assembler is clever enough to insert one automatically).

Edit: there's also of course a more general issue with the use of R4 in both the original code and my examples below - if you're interfacing with C code the original value must be preserved across the function call and restored afterwards (R0-R3 are designated argument/scratch registers and can be freely modified). If you're in pure assembly however you don't necessarily need to follow a standard calling convention so can be more flexible.

Now, that's a very literal representation of the C code, and doesn't make best use of the instruction set - in particular the indexed addressing modes. One of the attractions of assembly programming is having complete control of the instructions, so how can we make it worth our while?

First, let's make the C code look a little more like the assembly we want:

int main_compare (int nbytes, char *pmem1, char *pmem2){
    while(nbytes-- > 0) {    
        if(*pmem1++ != *pmem2++) {
            return 0;
        }
    }
    return 1;
}

Now that that shows our intent more clearly, let's play compiler:

byte_cmp_loop PROC
; assuming: r0 = nbytes, r1=pmem1, r2 = pmem2

_loop:
    SUBS R0, R0, #1   ; Decrement nbytes and set flags based on the result
    BMI  _finished    ; If nbytes is now negative, it was 0, so we're done

    LDRB R3, [R1], #1 ; Load from the address in R1, then add 1 to R1
    LDRB R4, [R2], #1 ; ditto for R2
    CMP R3, R4        ; If they match...
    BEQ _loop         ; then continue round the loop

    MOV R0, #0        ; else give up and return zero
    BX LR

_finished:
    MOV R0, #1        ; Success!
    BX LR
ENDP

And that's nearly 25% fewer instructions! Now if we pull in another instruction set feature - conditional execution - and relax the requirements slightly, without breaking C semantics, it gets smaller still:

byte_cmp_loop PROC
; assuming: r0 = nbytes, r1=pmem1, r2 = pmem2

_loop:
    SUBS R0, R0, #1 ; In C zero is false and any nonzero value is true, so
                    ; when R0 becomes -1 to trigger this branch, we can just
                    ; return that to indicate success
    IT MI           ; Make the following instruction conditional on 'minus'
    BXMI LR

    LDRB R3, [R1], #1
    LDRB R4, [R2], #1
    CMP R3, R4
    BEQ _loop

    MOVS R0, #0     ; Using MOVS rather than MOV to get a 16-bit encoding,
                    ; since updating the flags won't matter at this point
    BX LR
ENDP

assembling to a meagre 22 bytes, that's nearly 40% less code than we started with :D