Search code examples
cgccinline-assemblyvolatileriscv

Inline asm with input/output operands to update a pointer and update what it pointed to


I'm implementing 3*3 matrix and 3*1 matrix multiplication using RISC-V assembly language, with GNU C inline asm.

// description: matrix multiply with two-level for loop

#include<stdio.h>
int main()
{
        int f,i=0;
        int h[9]={0}, x[3]={0}, y[3]={0};
        FILE *input = fopen("../input/3.txt","r");
        for(i = 0; i<9; i++) fscanf(input, "%d", &h[i]);
        for(i = 0; i<3; i++) fscanf(input, "%d", &x[i]);
        for(i = 0; i<3; i++) fscanf(input, "%d", &y[i]);
        fclose(input);
        int *p_x = &x[0] ;
        int *p_h = &h[0] ;
        int *p_y = &y[0] ;

        for (i = 0 ; i < 3; i++)
        {

                p_x = &x[0] ;
        /*
         for (f = 0 ; f < 3; f++)
             *p_y += *p_h++ * *p_x++ ;
        */

                for (f = 0 ; f < 3; f++){
                   asm volatile (
                    "addi t0, zero, 2\n\t"
                    "bne t0, %[f], Multi\n\t"
                    "mul t1, %[sp_h], %[sp_x]\n\t"
                    "add %[sp_y], %[sp_y], t1\n\t"
                    "addi %[p_h], %[p_h], 4\n\t"
                    "addi %[p_x], %[p_x], 4\n\t"
                    "addi %[p_y], %[p_y], 4\n\t"
                    "beq zero, zero, Exit\n\t"
                    "Multi: \n\t"
                    "mul t1, %[sp_h], %[sp_x]\n\t"
                    "add %[sp_y], %[sp_y], t1\n\t"
                    "addi %[p_h], %[p_h], 4\n\t"
                    "addi %[p_x], %[p_x], 4\n\t"
                    "Exit: \n\t"
                    :[p_y] "+&r"(p_y),
                     [p_x] "+&r"(p_x),
                     [p_h] "+&r"(p_h),
                     [sp_y] "+&r"(*p_y)
                    :[sp_h] "r"(*p_h),
                     [sp_x] "r"(*p_x),
                     [f] "r"(f)
                    :"t0","t1"

                );
            printf("x value=%d, h value=%d, y value=%d, y address=%d\n", *p_x, *p_h, *p_y, p_y);
        }

        }

        p_y = &y[0];

        for(i = 0; i<3; i++)
                printf("%d \n", *p_y++);

        return(0) ;

}

I want to transfer this comment

for (f = 0 ; f < 3; f++) 
    *p_y += *p_h++ * *p_x++ ;
p_y++;

into asm volatile(...), but above code in asm I encounter this problem: my problem It seems that p_y address is added correct, and my multiplication is correct, but my value store into memory is wrong. It will store into memory too fast, and now my answer adds previous answer together. Above code answer is 5 28 69 Could anybody help me? I have edited the "+r" to "+&r" and add the clobbers to the code, but it doesn't work.

And by the way, i get right answer from this code:

for (f = 0 ; f < 3; f++)
                    asm volatile ("mul %[tmp], %[sp_h], %[sp_x]\n\t"
                                  "add %[sp_y], %[sp_y], %[tmp]\n\t"
                                  "addi %[p_x], %[p_x], 4\n\t"
                                  "addi %[p_h], %[p_h], 4\n\t"
                                  :[tmp] "=r"(tmp),
                                   [p_x] "+r"(p_x),
                                   [p_h] "+r"(p_h),
                                   [sp_y] "+r"(*p_y)
                                  :[sp_h] "r"(*p_h),
                                   [sp_x] "r"(*p_x)
                                );
            p_y++;

I expect

14 
32 
50

for the answer. And here is input data:

1 2 3 4 5 6 7 8 9
1 2 3
0 0 0

Solution

  • Your [sp_y] "+&r"(*p_y) input/output uses the original p_y to read the input, but it uses the p_y updated by the asm statement (in the f==2 case) to store the result. The compiler-generated asm from gcc -O1 -Wall (on Godbolt) shows this nicely:

    # top of inner loop
    .L4:
            lw      a1,0(s3)
            lw      a0,0(s2)
            lw      a3,0(s1)      # "+r"(*p_y) in/out operand in a3
            mv      a4,s1         # "+r"(p_y)  in a4
            mv      a5,s2
            mv      a2,s3
    # your asm starts here
            addi t0, zero, 2
            bne t0, s0, Multi
            mul t1, a1, a0
            add a3, a3, t1        # sum += stuff
            addi a2, a2, 4
            addi a5, a5, 4
            addi a4, a4, 4
            beq zero, zero, Exit
    Multi: 
            mul t1, a1, a0
            add a3, a3, t1       # sum += stuff in the other branch
            addi a2, a2, 4
            addi a5, a5, 4
    Exit: 
    
    # your asm ends here, compiler now has to move regs to C objects
    # (with more optimization, it might have just used s1 as [p_y] avoiding two mv)
            mv      s1,a4
            mv      s2,a5
            mv      s3,a2
            sw      a3,0(a4)    # store new sum using the new p_y
            lw      a2,0(a2)
            lw      a1,0(a5)
            addi    a0,s5,%lo(.LC3)
            call    printf
    

    GCC's docs don't mention what happens when one output modifies a C variable that's also part of lvalue expression for another output, at least not that I noticed. I'd assume the outputs are unsequenced wrt. each other, in which case it's undefined behaviour similar to x = ++x + ++x;, or at least unspecified exactly what happens.

    So don't do that. Don't conditionally increment p_y inside the inner loop, so don't make p_y an output (or input) to the inner-loop asm statement at all.

    I don't think there's any way to make that fully well-defined, although "=r"(p_y[f == 2 ? -1 : 0]) and a matching constraint like "0"(*p_y) would probably always use the updated p_y from the other output register. But then you're manually doing an offset in the C to balance what the asm does, totally defeating the purpose. So use a separate asm statement outside the inner loop to do the pointer increment in the outer loop.

    A separate int sum = *p_y; variable would be normal for this kind of thing, but you'd still have to assign *p_y = sum; before incrementing p_y, so you still couldn't do it inside the inner-loop asm statement. You don't want the compiler to sw / lw the sum inside the inner loop like you're encouraging it to do with *p_y.

    So it would look like

      int sum = *p_y;      // asm lw
      for (int f = 0 ; f < 3; f++) {
          asm("..." 
              : "+&r"(sum), ...  // not including p_y
              : [sp_h] "r"(*p_h),  [sp_x] "r"(*p_x)  // same as before
              : // no clobbers, you don't need to compare/branch inside the asm
          );
      }
      *p_y = sum;          // asm sw
      asm("addi %0, %0, 4" : "+r"(p_y));  // p_y++;
    

    Or write the whole outer-loop body (including the inner loop and the p_y++) in asm, including your own lw and sw, so you can manually put the addi %[p_y], %[p_y], 4 after an sw that uses %[p_y].

    (Don't forget a "memory" clobber for an asm statement with pointer inputs that reads/writes memory that isn't an "m" constraint.)