Search code examples
x86armcpu-architecturebreakpoints

What happens to the processor after a breakpoint is hit?


I was reading up about breakpoints from a few articles such as this: https://interrupt.memfault.com/blog/cortex-m-breakpoints

Most resources mention that the processor gets halted. What does a processor halt mean? The processor would still be getting clock input right? If so, it should ideally start fetching and executing the next instruction. However, that does not happen.

So, can anyone help me understand what happens to the processor when a break point is hit?


Solution

  • Do not confuse the clocking of a processor or any logic and fetch and execution. There is a disconnect between each clock and each fetch and execute. All of the processor is clocked not just the actual state machine that does the fetch, decode, execute, etc. The clock feeds memory interfaces, many of the signals and registers and such are clocked, there is some form of processor bus that runs off the clock, a debug interface is clocked, etc.

    Other than some low power modes we will not get into, the clock to the processor is not gated (like the gate around my yard, can be open or closed allowing people through or not)(the clock can be blocked or not). Clock gating is not used for breakpoint or halting in general, these are just inputs to the state machine that runs the processor. (google state machine or finite state machine).

    You may already know from textbook processors (fetch, decode, execute, writeback, ...)(which to some extent are not exactly how they are implemented outside the classroom, it is just a textbook understanding) that the processor may stall the pipeline. There are names for the reasons to stall, but this is another example where there is a disconnect between fetch/execute and the processors clock. The processor clock will keep going but the processor does not actually fetch and/nor execute. Usually this is temporary. In an MCU like the cortex-m family described in your external reference. You normally execute instructions from flash, and you normally talk to peripherals. It is very common for the flash to be running on a clock that is half or some other multiple slower than the processor clock. And for many if you push the processor clock faster with a pll there may be rules that the peripheral clock has to be slower. And no reason to assume that a bus transaction to a peripheral at the same clock rate or slower happens in one clock, it does not. Many of the cortex-m cores fetch either 16 or 32 bits at a time per fetch (bus) cycle. If the flash is say half the clock rate of the processor and it is fetching a halfword at a time, then the processor can only fetch one instruction every two clock cycles. And thus can execute no more than one instruction every two clock cycles as well, but is slower a lot of the time. Likewise if it takes say 8 clock cycles to read the status of the uart, then that one execution state of the LDR reading that address stalls the processor at least 8 clocks. Some prefetching may be in flight as well as some decoding but the execution stage is stalled and eventually the whole processor stalls. halts and breakpoints are no different, except that the processor never leaves the "execution" state. Or at least on its own. As shown with the debugger, there are signals that the human can interact with using the debugger that can kick the processor out of that execution state into other states (perhaps fetching from a new address or just moving it out of the execution state into the next state).

    So as an example let us make a simple cortex-m simulator, that only knows about a few instructions. It is not a parallel pipeline (and some of these cortex-ms have very few states in their pipe, not even enough for all the textbook states) but a serial execution, perhaps the thing you do in that college course before you move on to a parallel pipe. This could be optimized more, but it is intentionally broken into a number of states, and does to some extent one thing per clock.

    Some processors implement general purpose registers such that each register is its own chunk of flip flops and multiple registers can be accessed in one clock, for demonstration purposes mine is going to be a register file, a.k.a, sram. And single ported so that only one thing can happen at time, one read or one write. So if I want to do an add r0,r1,r2, then it takes a whole clock to get the value r2 from the register file, a whole separate clock to get r1, and a whole separate clock to write r0 (after adding).

    I am going to cheat a little here and there. I could go through the states it takes to do a reset, which for a cortex-m involves at a minimum reading the word at address 0x00000000 and the word at address 0x00000004 (note some folks think instructions do this, no just logic, instructions are a concept that the logic operates on just like words on this page are something that mean something to us, but are built out of individual letters of the alphabet and displayed using many pixels. These reads of memory are likely separate bus cycles just to get the stack pointer init value and the reset exception handler address. So I cheated there, also I made my memory 16 bit wide not 32 nor 64, etc. Makes the code a bit easier to read.

    My program under test is

    .thumb
    .cpu cortex-m0
    
    .word 0x20001000
    .word reset
    .thumb_func
    reset:
        add r1,#1
        add r2,#2
        add r1,r1,r2
        add r1,r1,r1
        add r1,#3
        add r1,#4
        add r1,#5
        add r1,#6
        bkpt
        add r1,#7
        add r1,#8
        add r1,#9
        add r1,#10
        b .
    

    And here is my state machine based processor in C

    #include <stdio.h>
    #include <string.h>
    #include <stdlib.h>
    
    unsigned short mem[256];
    unsigned int reg[16];
    unsigned int pc;
    unsigned int pc_next;
    unsigned int state;
    unsigned int next_state;
    unsigned int alu_a;
    unsigned int alu_b;
    unsigned int alu_out;
    unsigned int rd;
    unsigned int rn;
    unsigned int rm;
    unsigned int imm;
    unsigned short inst;
    
    unsigned int print_breakpoint_flag;
    
    enum
    {
        NONE,
        FETCH,
        DECODE,
        ADD_IMM,
        ADD_REG_A,
        ADD_REG_B,
        ADD_EXECUTE,
        ADD_WRITEBACK,
        BREAKPOINT
    };
    
    void reset ( void )
    {
        //normally this is assumed to be random garbage not zeros;
        memset(reg,0,sizeof(reg));
        //this would normally be done in the state machine as well with
        //some number of clock cycles
        reg[13]=mem[1];
        reg[13]<<=16;
        reg[13]|=mem[0];
        pc_next=mem[3];
        pc_next<<=16;
        pc_next|=mem[2];
        if((pc_next&1)==0)
        {
            printf("thumb only\n");
            exit(1);
        }
        pc_next&=(~1);
        printf("RESET 0x%08X 0x%08X\n",reg[13],pc_next);
        next_state=FETCH;
    }
    
    void one_clock ( void )
    {
        state=next_state;
        next_state=NONE;
        switch(state)
        {
            case FETCH:
            {
                printf("FETCH 0x%08X\n",pc_next);
                pc=pc_next;
                pc_next=pc+2;
                reg[15]=pc+4;
                inst=mem[pc>>1];
                next_state=DECODE;
                print_breakpoint_flag=1;
                break;
            }
            case DECODE:
            {
                printf("DECODE 0x%04X\n",inst);
                if((inst&0xF800)==0x3000)
                {
                    //add immediate immed8
                    rd=(inst>>8)&0x7;
                    rn=rd;
                    imm=(inst>>0)&0xFF;
                    next_state=ADD_IMM;
                }
                if((inst&0xFE00)==0x1800)
                {
                    rd=(inst>>0)&7;
                    rn=(inst>>3)&7;
                    rm=(inst>>6)&7;
                    next_state=ADD_REG_A;
                    break;
                }
                if((inst&0xFF00)==0xBE00)
                {
                    imm=(inst>>0)&0xFF;
                    next_state=BREAKPOINT;
                }
                break;
            }
            case ADD_IMM:
            {
                printf("ADD_IMM r%u #0x%X\n",rn,imm);
                alu_a=reg[rn];
                alu_b=imm;
                next_state=ADD_EXECUTE;
                break;
            }
            case ADD_EXECUTE:
            {
                printf("ADD_EXECUTE 0x%08X 0x%08X\n",alu_a,alu_b);
                alu_out=alu_a+alu_b;
                //not doing flags
                next_state=ADD_WRITEBACK;
                break;
            }
            case ADD_WRITEBACK:
            {
                printf("ADD_WRITEBACK r%u 0x%08X\n",rd,alu_out);
                reg[rd]=alu_out;
                next_state=FETCH;
                break;
            }
            case ADD_REG_A:
            {
                printf("ADD_REG_A r%u\n",rn);
                alu_a=reg[rn];
                next_state=ADD_REG_B;
                break;
            }
            case ADD_REG_B:
            {
                printf("ADD_REG_B r%u\n",rm);
                alu_b=reg[rm];
                next_state=ADD_EXECUTE;
                break;
            }
            case BREAKPOINT:
            {
                if(print_breakpoint_flag)
                {
                    printf("BREAKPOINT\n");
                    print_breakpoint_flag=0;
                }
                //some debugger hardware would be implemented to 
                //kick the state machine out of this state.
                next_state=BREAKPOINT;
                break;
            }
            default:
            {
                exit(0);
            }
        }
    
    
    }
    
    int main ( void )
    {
    
    //00000000 <reset-0x8>:
        mem[1]=0x2000; mem[0]=0x1000;//0:   20001000    .word   0x20001000
        mem[3]=0x0000; mem[2]=0x0009;//4:   00000009    .word   0x00000009
    //00000008 <reset>:
        mem[ 4]=0x3101; //   8:   3101        adds    r1, #1
        mem[ 5]=0x3202; //   a:   3202        adds    r2, #2
        mem[ 6]=0x1889; //   c:   1889        adds    r1, r1, r2
        mem[ 7]=0x1849; //   e:   1849        adds    r1, r1, r1
        mem[ 8]=0x3103;//1  10:   3103        adds    r1, #3
        mem[ 9]=0x3104;//1  12:   3104        adds    r1, #4
        mem[10]=0x3105;//1  14:   3105        adds    r1, #5
        mem[11]=0x3106;//1  16:   3106        adds    r1, #6
        mem[12]=0xbe00;//1  18:   be00        bkpt    0x0000
        mem[13]=0x3107;//1  1a:   3107        adds    r1, #7
        mem[14]=0x3108;//1  1c:   3108        adds    r1, #8
        mem[15]=0x3109;//1  1e:   3109        adds    r1, #9
        mem[16]=0x310a;//2  20:   310a        adds    r1, #10
        mem[17]=0xe7fe;//2  22:   e7fe        b.n 22 <reset+0x1a>
    
        reset();
        while(1)
        {
            one_clock();
        }
        return(0);
    }
    

    The processor clock runs forever, no matter what

        while(1)
        {
            one_clock();
        }
    

    I did not deal with the flags that an add does, I am not doing any conditional execution, in the few instructions I supported, this is not a complete processor obviously this is just the minimum brute force code to handle a few instructions.

    The output of the program looks like this

    RESET 0x20001000 0x00000008
    FETCH 0x00000008
    DECODE 0x3101
    ADD_IMM r1 #0x1
    ADD_EXECUTE 0x00000000 0x00000001
    ADD_WRITEBACK r1 0x00000001
    FETCH 0x0000000A
    DECODE 0x3202
    ADD_IMM r2 #0x2
    ADD_EXECUTE 0x00000000 0x00000002
    ADD_WRITEBACK r2 0x00000002
    FETCH 0x0000000C
    DECODE 0x1889
    ADD_REG_A r1
    ADD_REG_B r2
    ADD_EXECUTE 0x00000001 0x00000002
    ADD_WRITEBACK r1 0x00000003
    FETCH 0x0000000E
    DECODE 0x1849
    ADD_REG_A r1
    ADD_REG_B r1
    ADD_EXECUTE 0x00000003 0x00000003
    ADD_WRITEBACK r1 0x00000006
    FETCH 0x00000010
    DECODE 0x3103
    ADD_IMM r1 #0x3
    ADD_EXECUTE 0x00000006 0x00000003
    ADD_WRITEBACK r1 0x00000009
    FETCH 0x00000012
    DECODE 0x3104
    ADD_IMM r1 #0x4
    ADD_EXECUTE 0x00000009 0x00000004
    ADD_WRITEBACK r1 0x0000000D
    FETCH 0x00000014
    DECODE 0x3105
    ADD_IMM r1 #0x5
    ADD_EXECUTE 0x0000000D 0x00000005
    ADD_WRITEBACK r1 0x00000012
    FETCH 0x00000016
    DECODE 0x3106
    ADD_IMM r1 #0x6
    ADD_EXECUTE 0x00000012 0x00000006
    ADD_WRITEBACK r1 0x00000018
    FETCH 0x00000018
    DECODE 0xBE00
    BREAKPOINT
    

    Each line a clock, a state in the state machine, except at the end, instead of printing BREAKPOINT infinitely I only print it once.

    And hopefully this demonstrates the question.

        case BREAKPOINT:
        {
            next_state=BREAKPOINT;
            break;
        }
    

    The processor is being clocked, the clock does not stop, the state machine is stuck in the breakpoint state forever.

    In a real processor there would be a way out, some other signals also clocked by the processor but not in this state machine but other state machines (remember unlike a C program, things happen in parallel, to demonstrate that one_clock would have multiple state machines or other individual signals)

    a little google and some code from stackoverflow

    #include <stdio.h>
    #include <string.h>
    #include <stdlib.h>
    #include <unistd.h>
    #include <sys/select.h>
    #include <termios.h>
    
    unsigned short mem[256];
    unsigned int reg[16];
    unsigned int pc;
    unsigned int pc_next;
    unsigned int state;
    unsigned int next_state;
    unsigned int alu_a;
    unsigned int alu_b;
    unsigned int alu_out;
    unsigned int rd;
    unsigned int rn;
    unsigned int rm;
    unsigned int imm;
    unsigned short inst;
    
    unsigned int print_breakpoint_flag;
    unsigned int exit_breakpoint;
    
    enum
    {
        NONE,
        FETCH,
        DECODE,
        ADD_IMM,
        ADD_REG_A,
        ADD_REG_B,
        ADD_EXECUTE,
        ADD_WRITEBACK,
        BREAKPOINT
    };
    
    void reset ( void )
    {
        //normally this is assumed to be random garbage not zeros;
        memset(reg,0,sizeof(reg));
        //this would normally be done in the state machine as well with
        //some number of clock cycles
        reg[13]=mem[1];
        reg[13]<<=16;
        reg[13]|=mem[0];
        pc_next=mem[3];
        pc_next<<=16;
        pc_next|=mem[2];
        if((pc_next&1)==0)
        {
            printf("thumb only\r\n");
            exit(1);
        }
        pc_next&=(~1);
        printf("RESET 0x%08X 0x%08X\r\n",reg[13],pc_next);
        next_state=FETCH;
    }
    
    void one_clock ( void )
    {
        state=next_state;
        next_state=NONE;
        switch(state)
        {
            case FETCH:
            {
                printf("FETCH 0x%08X\r\n",pc_next);
                pc=pc_next;
                pc_next=pc+2;
                reg[15]=pc+4;
                inst=mem[pc>>1];
                next_state=DECODE;
                print_breakpoint_flag=1;
                break;
            }
            case DECODE:
            {
                printf("DECODE 0x%04X\r\n",inst);
                if((inst&0xF800)==0x3000)
                {
                    //add immediate immed8
                    rd=(inst>>8)&0x7;
                    rn=rd;
                    imm=(inst>>0)&0xFF;
                    next_state=ADD_IMM;
                }
                if((inst&0xFE00)==0x1800)
                {
                    rd=(inst>>0)&7;
                    rn=(inst>>3)&7;
                    rm=(inst>>6)&7;
                    next_state=ADD_REG_A;
                    break;
                }
                if((inst&0xFF00)==0xBE00)
                {
                    imm=(inst>>0)&0xFF;
                    next_state=BREAKPOINT;
                }
                break;
            }
            case ADD_IMM:
            {
                printf("ADD_IMM r%u #0x%X\r\n",rn,imm);
                alu_a=reg[rn];
                alu_b=imm;
                next_state=ADD_EXECUTE;
                break;
            }
            case ADD_EXECUTE:
            {
                printf("ADD_EXECUTE 0x%08X 0x%08X\r\n",alu_a,alu_b);
                alu_out=alu_a+alu_b;
                //not doing flags
                next_state=ADD_WRITEBACK;
                break;
            }
            case ADD_WRITEBACK:
            {
                printf("ADD_WRITEBACK r%u 0x%08X\r\n",rd,alu_out);
                reg[rd]=alu_out;
                next_state=FETCH;
                break;
            }
            case ADD_REG_A:
            {
                printf("ADD_REG_A r%u\r\n",rn);
                alu_a=reg[rn];
                next_state=ADD_REG_B;
                break;
            }
            case ADD_REG_B:
            {
                printf("ADD_REG_B r%u\r\n",rm);
                alu_b=reg[rm];
                next_state=ADD_EXECUTE;
                break;
            }
            case BREAKPOINT:
            {
                if(print_breakpoint_flag)
                {
                    printf("BREAKPOINT\r\n");
                    print_breakpoint_flag=0;
                }
                //some debugger hardware would be implemented to 
                //kick the state machine out of this state.
                next_state=BREAKPOINT;
                if(exit_breakpoint)
                {
                    exit_breakpoint=0;
                    next_state=FETCH;
                }
                break;
            }
            default:
            {
                exit(0);
            }
        }
    
    
    }
    
    
    
    struct termios orig_termios;
    
    void reset_terminal_mode()
    {
        tcsetattr(0, TCSANOW, &orig_termios);
    }
    
    void set_conio_terminal_mode()
    {
        struct termios new_termios;
    
        /* take two copies - one for now, one for later */
        tcgetattr(0, &orig_termios);
        memcpy(&new_termios, &orig_termios, sizeof(new_termios));
    
        /* register cleanup handler, and set the new terminal mode */
        atexit(reset_terminal_mode);
        cfmakeraw(&new_termios);
        tcsetattr(0, TCSANOW, &new_termios);
    }
    
    int kbhit()
    {
        struct timeval tv = { 0L, 0L };
        fd_set fds;
        FD_ZERO(&fds);
        FD_SET(0, &fds);
        return select(1, &fds, NULL, NULL, &tv) > 0;
    }
    
    int getch()
    {
        int r;
        unsigned char c;
        if ((r = read(0, &c, sizeof(c))) < 0) {
            return r;
        } else {
            return c;
        }
    }
    
    int main ( void )
    {
    
    //00000000 <reset-0x8>:
        mem[1]=0x2000; mem[0]=0x1000;//0:   20001000    .word   0x20001000
        mem[3]=0x0000; mem[2]=0x0009;//4:   00000009    .word   0x00000009
    //00000008 <reset>:
        mem[ 4]=0x3101; //   8:   3101        adds    r1, #1
        mem[ 5]=0x3202; //   a:   3202        adds    r2, #2
        mem[ 6]=0x1889; //   c:   1889        adds    r1, r1, r2
        mem[ 7]=0x1849; //   e:   1849        adds    r1, r1, r1
        mem[ 8]=0x3103;//1  10:   3103        adds    r1, #3
        mem[ 9]=0x3104;//1  12:   3104        adds    r1, #4
        mem[10]=0x3105;//1  14:   3105        adds    r1, #5
        mem[11]=0x3106;//1  16:   3106        adds    r1, #6
        mem[12]=0xbe00;//1  18:   be00        bkpt    0x0000
        mem[13]=0x3107;//1  1a:   3107        adds    r1, #7
        mem[14]=0x3108;//1  1c:   3108        adds    r1, #8
        mem[15]=0x3109;//1  1e:   3109        adds    r1, #9
        mem[16]=0x310a;//2  20:   310a        adds    r1, #10
        mem[17]=0xe7fe;//2  22:   e7fe        b.n 22 <reset+0x1a>
    
        set_conio_terminal_mode();
        exit_breakpoint=0;
        reset();
        while(1)
        {
    
            if(kbhit())
            {
                getch();
                exit_breakpoint=1;
            }
    
            one_clock();
        }
        return(0);
    }
    

    At least on my linux system, I can now run it, it "halts" at the breakpoint, until I press a key on the keyboard, and then it continues.

        case BREAKPOINT:
        {
            if(print_breakpoint_flag)
            {
                printf("BREAKPOINT\r\n");
                print_breakpoint_flag=0;
            }
            //some debugger hardware would be implemented to 
            //kick the state machine out of this state.
            next_state=BREAKPOINT;
            if(exit_breakpoint)
            {
                exit_breakpoint=0;
                next_state=FETCH;
            }
            break;
        }
    
    BREAKPOINT
    FETCH 0x0000001A
    DECODE 0x3107
    ADD_IMM r1 #0x7
    ADD_EXECUTE 0x00000018 0x00000007
    ADD_WRITEBACK r1 0x0000001F
    FETCH 0x0000001C
    DECODE 0x3108
    ADD_IMM r1 #0x8
    ADD_EXECUTE 0x0000001F 0x00000008
    ADD_WRITEBACK r1 0x00000027
    FETCH 0x0000001E
    DECODE 0x3109
    ADD_IMM r1 #0x9
    ADD_EXECUTE 0x00000027 0x00000009
    ADD_WRITEBACK r1 0x00000030
    FETCH 0x00000020
    DECODE 0x310A
    ADD_IMM r1 #0xA
    ADD_EXECUTE 0x00000030 0x0000000A
    ADD_WRITEBACK r1 0x0000003A
    FETCH 0x00000022
    DECODE 0xE7FE
    

    And we see the rest of the execution up to the branch to self, which I did not implement, so it hits the NONE state and exits the program.

    In the case of the arm cortex-m family

    Breakpoint causes a HardFault exception or a debug halt to occur depending on the presence and configuration of the debug support.

    I have chosen "debug halt" here. If implemented as a HardFault instead then the processor would not stop execution it would then read the HardFault exception handler address, and then fetch instructions there, as well as all the stack stuff that the processor does to save state before handling the exception. All of this is to some extent documented in the arm documentation.

    The cortex-m and arms in general have WFI as an example which is wait for interrupt

    Wait For Interrupt is a hint instruction that suspends execution until one of a number of events occurs.

    And then a page of the documentation goes through the possible ways of getting out of a WFI (if actually implemented, some cores a WFI is just a nop and it does not wait).

    A halt if a processor has it would have fewer ways out than a WFI but would be similar to a breakpoint as far as how to get out of it (a reset or some debugger interaction to change the state of the state machine).

    Not all processors have a halt. What I have shown so far is a breakpoint, and while in a ram based system where your program is in ram, as your external documentation states or implies, an instruction would be replaced by a breakpoint.

    add r1,#4
    add r1,#5
    add r1,#6
    add r3,r3,r1 
    add r1,#7
    add r1,#8
    add r1,#9
    add r1,#10
    

    you might go into some gui debug tool and select the add r3,r3,r1 and click some breakpoint thing. That may literally cause the gui software to write a 0xbe00 instruction where that add was, and the software would remember that the add was there. When you execute, the breakpoint happens, and some debug logic tells the gui (more wires and signals in the processor that the execution of the breakpoint can be detected by the debugger). When you press some continue button. The gui will/may replace the breakpoint instruction in that memory location with the real add instruction, and then change the processor state to execute that address again. That would be the kind of debugger that clears the breakpoint once you stopped on it. Some may keep that breakpoint and in that case would likely replace the breakpoint with the real instruction, single step, then replace the instruction with the breakpoint then continue execution.

    Single stepping with a debugger is just more signals into the processor execution state machine from some some other debugger state machine, to put the processor into a halted state. state=next_state; if debugger_state=step, then state=WAIT_FOR_STEP. and that wait for step state would wait for some debugger state to change, or some other signal. (think of signals and registers in logic as variables in C).

    The other example of halting on an address, would be in our FETCH state for example

        case FETCH:
        {
            if(pc_next==hardware_monitor_address)
            {
                next_state=HALT;
                break;
            }
            printf("FETCH 0x%08X\r\n",pc_next);
            pc=pc_next;
            pc_next=pc+2;
            reg[15]=pc+4;
    

    and that would put us in a halt state similar to the breakpoint and we would need signals from the debugger to kick us out of that state.