AVR assembly - problem with rotation of register after comparison

I'm trying to write some code in assembly but can't get over the problem of using ror,rol instruction after cpi, cp instruction. I got stuck on it for some while and can't find nothing about it. I have some older code when I used cpi then moved into other branch of program, did some instruction and last of them was rotation with no problem. I tried add some instruction before rotation if it change something and it didn't help. My goal is to get in r16 dec values of 254,253,251 etc.. Just rotate the content of register. Instead I'm getting 254,252,248. If i'm not using cp or cpi the problem doesn't occur. I'm a newbie to asm programming so sorry if it's a really dumb question. Code lower is simplification of what I need because I work with user input so half of the code is just there to simulate same function of program.

ldi r16,0b11111111
    ldi r17,0b00000001
    ldi r18,0b00000001
    ldi r20,0b00000000
start:
    cp r18,r17
    breq loop
loop:
    inc r20
    rol r16
    rjmp start

Solution

Your goal is to get the following values:

255 = 0xFF = 11111111
254 = 0xFE = 11111110
253 = 0xFD = 11111101
251 = 0xFB = 11111011

But you use a rol instruction which shift your register by 1 and add the carry. The carry is zero all the time, so you always add a zero. This results in the following values:

255 = 0xFF = 11111111
254 = 0xFE = 11111110
252 = 0xFC = 11111100
248 = 0xF8 = 11111000

The "problem" is, that the cp instruction use the carry bit too, so it will clear the carry bit, because the operation r18 minus r17 doesn´t set the carry (your cp r18, r17). Your code works without the cp because the carry doesn´t get cleared at any time.

One possible solution is to use a cpi before you call your rol:

cpi r16, 255
rol r16

Now your carry will get set when the content of r16 is below 255, (cpi performs the operation r16 - 255). The carry will be used by your rol and you got the correct result. One positive aspect is, that the solution only needs one additional clock per cycle, so it´s a little bit smarter than a branch instruction or something else.