Search code examples
cinline-assemblyneon

C embedded assembly error: ‘asm’ operand has impossible constraints


When I embedded assembly in C language, I met the following error compiling these code using a shell command in ubuntu linux 14.04.

    IFR_temp_measure.cpp: In function ‘void BlockTempClc(char*, char*, 
     int, int, char, int, int, int, int*, int, int*, int)’:
     IFR_temp_measure.cpp:1843:6: error: ‘asm’ operand has impossible 
    constraints);
    ^
    &make: *** [IFR_temp_measure.o] Error 1

or the position of the error code line 1842,1843 is respond to the code

    :"cc", "memory","q0", "q1", "q2", "q3", "q4", "q5", "q6", "q7", "q8", "q10", "q11", "q12", "q13", "q14", "q15","r0", "r1", "r3", "r4", "r5","r6","r8", "r9", "r10", "r12"
            );

I have tried to solve this problem,but Few references are available online,there is a linker: Gcc inline assembly what does "'asm' operand has impossible constraints" mean? and http://www.ethernut.de/en/documents/arm-inline-asm.html but not helped. My code is as follows:

    void BlockTempClc(char* src1,char* src2,int StrideDist,int height,char temp_comp1,int numofiterations,int temp_comp2,int temp_comp3,int *dstData,int width,int *dstSum,int step)
{

            volatile char array1[16] = {0,0,0,0,0,0,0,0,
                                       0,0,0,0,0,0,0,0};
            volatile char array2[16] = {0,0,1,0,2,0,3,0,
                                       4,0,5,0,6,0,7,0};
            asm volatile(   
            "mov        r0, %0; " //image[0]    
            "mov        r1, %1; "  //image[1] 
            "mov        r12,%11; " //m
            "mov        r3, %4; " //n
            "mov        r4, %2; " //store data
            "mov        r8, %12; " //step down for loading next line of image
            "mov        r5, %6; " //numofiterations
            "mov        r6, %3; " //out

            "mov.8 r9,%5;"//isp_temp_comp
            "mov.8 r10,%7;"//led_temp_comp
            "mov.8 r11,%8;"//fac_temp_comp


            "vdup.8 d20,r9;"//copy arm register value isp_temp_comp to neon  register
            "VMOV.S16 q9, d20; " //isp_temp_comp transfer to signed short type

            "VLD1.8     {d16,d17}, [%9];"//q8  array1 sum
            "VLD1.8     {d6,d7}, [%10];"//q3  array2

            "VMOV.S16   q0, #256; "
            "VMOV.S16   q1, #2730; " //Assign immediate number 2730 to each 16 bits of d1       

            ".loop:;"           

            "vdup.8 d21,r10;"//copy arm register value led_temp_comp to neon  register 
            "vdup.8 d22,r11;"//copy arm register value fac_temp_comp to neon  register 

            "VLD1.8    d14, [r1],r8; "    // q7  *(image[1] + tmp + n)  Load: Load Picture Pixels   r6:move step  ?
            "VLD1.8    d15, [r0],r8 "    // *(image[0] + tmp + n)  Load: Load Picture Pixels            

            "PLD        [r1]; " //Preload: one line in cache
            "PLD        [r0]; "  //?

            "VMOV.S16  q5, d14; " //q5    8*16  transfer to signed short type:*(image[1] + tmp + n) 
            "VMOV.S16  q6, d15; " //q6    8*16  transfer to signed short type : *(image[0] + tmp + n) 

            "VADD.S16  q12,q6, q9;"//*(image[0] + tmp + n) + isp_temp_comp              
            "VMOV.S16  q6, d21; " //led_temp_comp
            "VADD.S16  q13,q12, q6;"//*(image[0] + tmp + n) + isp_temp_comp+ + led_temp_comp
            "VMOV.S16  q6, d22; " //fac_temp_comp
            "VADD.S16  q14,q13, q6;"//*(image[0] + tmp + n) + isp_temp_comp+ + led_temp_comp+ fac_temp_comp
            "VSUB.S16  q15,q14, q1;"//*(image[0] + tmp + n) + isp_temp_comp+ + led_temp_comp+ fac_temp_comp-2730
            "VMLA.S16   q15, q5, q0;"//img_temp[m][n]=*(image[0] + tmp + n) + isp_temp_comp+ + led_temp_comp+ fac_temp_comp-2730+*(image[1] + tmp + n) *256 


            "VADD.S16  q2,q15, q8;"//sum                
            "VMOV.S16    q8, q2; " //q8


            "vdup.8 d20,r3;"//n 
            "vdup.8 d21,r12;"//m

            "VMOV.S16  q11, d20; " //n
            "VMOV.S16  q10, d21; " //m

            "VADD.S16  q4,q3, q11;"//(n,n+1,n+2,n+3,n+4,n+5,n+6,n+7)
            "VADD.S16  q7,q3, q10;"//(m,m+1,m+2,m+3,m+4,m+5,m+6,m+7)  q7


            "VST1.16     {d30[0]}, [r4]!; "//restore img_temp[m][n] to pointer data
            "VST1.16     {d14[0]}, [r4]!; "//restore m
            "VST1.16     {d8[0]}, [r4]!; "  //restore n

            "VST1.16     {d30[1]}, [r4]!; "     
            "VST1.16     {d14[1]}, [r4]!; "
            "VST1.16     {d8[1]}, [r4]!; "

            "VST1.16     {d30[2]}, [r4]!; "     
            "VST1.16     {d14[2]}, [r4]!; "
            "VST1.16     {d8[2]}, [r4]!; "

            "VST1.16     {d30[3]}, [r4]!; "     
            "VST1.16     {d14[3]}, [r4]!; "
            "VST1.16     {d8[3]}, [r4]!; "//response to array

            "subs        r5, r5, #1; "   // decrement: numofinteration -= 1;
            "bne        .loop; "        // Branch If Not Zero; to .loop
            "VST1.16     {d4[0]}, [r6]!; "//q2 refer to sum restore the final result to pointer out
            "VST1.16     {d4[1]}, [r6]!; "
            "VST1.16     {d4[2]}, [r6]!; "
            "VST1.16     {d4[3]}, [r6]!; "
            "VST1.16     {d5[0]}, [r6]!; "
            "VST1.16     {d5[1]}, [r6]!; "
            "VST1.16     {d5[2]}, [r6]!; "
            "VST1.16     {d5[3]}, [r6]!; "

            :"+r"(src1),"+r"(src2),"+r"(dstData),"+r"(dstSum),"+r"(height)
            :"r"(temp_comp1),"r"(numofiterations),"r"(temp_comp2),"r"(temp_comp3),
                "r"(array1),"r"(array2), "r"(width),"r"(step)
            :"cc", "memory","q0", "q1", "q2", "q3", "q4", "q5", "q6", "q7", "q8", "q10", "q11", "q12", "q13", "q14", "q15","r0", "r1", "r3", "r4", "r5","r6","r8", "r9", "r10", "r12"
            );
}

I suppose the problem maybe output operands lists or output operands lists. Whats cause the error of my code?and how to solve it?


Solution

  • You declare clobbers on most of the integer registers, but then you ask for 13 different input variables. 32-bit ARM only has 16 registers, and 2 of those are PC and SP leaving only 14 at best really general purpose registers.

    We can test that too many clobbers + operands are the problem by removing all the clobbers on r0.. r12; this lets it compile (into incorrect code!!). https://godbolt.org/z/Z6x78N This is not the solution because it introduces huge bugs, it's just how I confirmed that this is the problem.

    Any time your inline asm template starts with mov to copy from an input register operand into a hard-coded register, you're usually doing it wrong. Even if you had enough registers, the compiler is going to have to emit code to get the variable into a register, then your hand-written asm uses another mov to copy it for no reason.

    See https://stackoverflow.com/tags/inline-assembly/info for more guides.

    Instead ask the compiler for the input in that register in the first place with register int foo asm("r0"), or better let the compiler do register allocation by using %0 or the equivalent named operand like %[src1] instead of a hard-coded r0 everywhere inside your asm template. The syntax for naming an operand is [name] "r" (C_var_name). They don't have to match, but they don't have to be unique either; it's often convenient to use the same asm operand name as the C var name.

    Then you can remove the clobbers on most of the GP registers. You do need to tell the compiler about any input registers you modify, e.g. by using a "+r" constraint instead of "r" (and then not using that C variable after the asm modifies it). Or use an "=r" output constraint and a matching input constraint like "0" (var) to put that input in the same register as output operand 0. "+r" is much easier in a wrapper function where the C variable is not used afterwards anyway.

    You can remove the clobbers on vector registers if you use dummy output operands to get the compiler to do register allocation, but it's basically fine if you just leave those hard-coded.

    asm(  // "mov        r0, %[src1]; "   // remove this and just use %[src1] instead of r0
    
          "... \n\t"
          "VST1.16     {d30[0]}, [%[dstData]]!   \n\t"  //restore img_temp[m][n] to pointer data
          "... \n\t"
    
        : [src1]"+&r"(src1), [src2]"+&r"(src2), [dstData]"+&r"(dstData),
          [dstSum]"+&r"(dstSum), [height]"+&r"(height)
    
        : [temp_comp1] "r"(temp_comp1),  [niter] "r"(numofiterations),
          [temp_comp2] "r"(temp_comp2), [temp_comp3] "r"(temp_comp3),
          ...
        : "memory", "cc", all the q and d regs you use.  // but not r0..r13
       );
    

    You can look at the compiler's asm output to see how it filled in the %0 and %[name] operands in the asm template you gave it. Use "instruction \n\t" to make this readable, ; puts all the instructions onto the same line in the asm output. (C string-literal concatenation doesn't introduce newlines).

    The early-clobber declarations on the read/write operands makes sure that none of the input-only operands share a register with them, even if they have the compiler knows that temp_comp1 == height for example. Because the original value of temp_comp1 still needs to be readable from the register %[temp_comp1], even after something has modified %[height]. So they can't both be r4 for example. Otherwise, without the & in "+&r", the compiler could choose that to gain efficiency if outputs are only written after all inputs are read. (e.g. when wrapping a single instruction, like GNU C inline asm is designed to do efficiently).


    side-note: char array1[16] and 2 don't need to be volatile; the "memory" clobber on the asm statement is sufficient even though you just pass pointers to them, not use them as "m" input operands.