multithreading assembly process operating-system context-switching

Can anyone explain this assembly code for the "stack" point of view?

Can anyone explain the arm based assembly code for the stack point of view ; specifically the stack view for "reset_handler" right before calling "main", "save_context" and "resume" part? (Note that I know what the code is doing but I can't comprehend or imagine how exactly the stack look like or behave while the code is running).

   */ asm.s */

  .global main, process, process_size
  .global reset_handler, context_switch, running

reset_handler:
   ldr r0, =process
   ldr r1, =process_size 
   ldr r2, [r1, #0] 
   add r0, r0, r2 
   mov sp, r0 
   bl main 
   
context_switch:
 save_context:
  stmfd sp!, {r0-r12, lr}
  ldr r0, =running 
  ldr r1, [r0, #0] 
  str sp, [r1, #4] 

 resume:
  ldr r0, =running
  ldr r1, [r0, #0] 
  ldr sp, [r1, #4] 
  ldmfd sp!, {r0-r12, lr} 
  mov pc, lr

*/ cfile.c */

#define SIZE 2048 
typedef struct process
{
   struct process *next; 
   int *saved_stack;
   int running_stack[SIZE]; 
}PROC;

int process_size = sizeof(PROC);

PROC process, *running; 

main() 
{
  running = &process; 
  context_switch();
}

Solution

As a background -- the processor's registers pretty much define what it is doing. They are often referred to as a context. The most important is the program counter pc which contains the memory address of the next instruction; however they are all important. So lets look at how to save a context:

save_context:
  stmfd sp!, {r0-r12, lr}
     -- that instruction saved to processor context to the stack
     -- it could be broken down as follows:
     -- sp = sp - 14*4    4, because each register is 4 bytes, and there are 14 specfied
     -- for (i=0; i < 13; i++)   sp[i] = r(i);
     -- sp[i] = lr      `lr` is special, it holds the return address of the instruction that called us.
  ldr r0, =running 
     -- put the address of the variable `running` into r0
  ldr r1, [r0, #0] 
     -- load r1 with the memory address from r0.  So r1 = running.
  str sp, [r1, #4] 
     -- store the stack pointer (sp) in the `saved_sp` field of running.
     -- so these three instructions perform:  running->saved_stack = sp;
     -- now we "fall through" to load, or `resume` a context.
 resume:
  ldr r0, =running
  ldr r1, [r0, #0] 
  ldr sp, [r1, #4]
      -- the inverse of the above, these three instructions effectively perform:
      -- sp = running->saved_stack 
  ldmfd sp!, {r0-r12, lr} 
      -- this is the complimentary operation to the complicated save one above; but this time it is:
      -- for (i=0; i < 13; i++) r(i) = sp[i];
      -- lr = sp[i];
      -- sp += 14*4;
  mov pc, lr
      -- this is a return instruction, where the program counter is loaded with the contents of the link register `lr`.
      -- so, with this, it will return to main just after the call to context_switch

There are a few fuzzy bits in the above: sp[i] would have to scale i by the sizeof a register (4); but earlier sp is reduced by 14*4. Since the pseudo-C isn't real, it seems ok.