gcc arm inline assembler %e0 and %f0 operand modifiers for 16-byte NEON operands?

Found the following inline assembler code to calculate the vector cross product :

float32x4_t cross_test( const float32x4_t& lhs, const float32x4_t& rhs )
{
    float32x4_t result;

    asm volatile(
    "vext.8 d6, %e2, %f2, #4 \n\t"          
    "vext.8 d7, %e1, %f1, #4 \n\t"  
    "vmul.f32 %e0, %f1, %e2  \n\t" 
    "vmul.f32 %f0, %e1, d6   \n\t" 
    "vmls.f32 %e0, %f2, %e1  \n\t" 
    "vmls.f32 %f0, %e2, d7   \n\t" 
    "vext.8 %e0, %f0, %e0, #4    "      
    : "+w" ( result )                  
    : "w" ( lhs ), "w" ( rhs )            
    : "d6", "d7" );

    return result;
}

What do the modifiers e and f after '%' mean (e.g. %e2)? I can not find any reference for this.

This is the assembler code generated by gcc:

vext.8 d6, d20, d21, #4 
vext.8 d7, d18, d19, #4 
vmul.f32 d16, d19, d20  
vmul.f32 d17, d18, d6   
vmls.f32 d16, d21, d18  
vmls.f32 d17, d20, d7   
vext.8 d16, d17, d16, #4

I now understood the meaning of the used modifiers. Now I tried to follow the cross product algorithm. For this I added some additional comments to the assembler code but the result is not equal to my expectation:

    // History:
    // - '%e'  = lower register part
    // - '%f'  = higher register part
    // - '%?0' = res = [ x2 y2 | z2 v2 ]
    // - '%?1' = lhs = [ x0 y0 | z0 v0 ]
    // - '%?2' = rhs = [ x1 y1 | z1 v1 ]
    // - '%e0'       = [ x2 y2 ]
    // - '%f0'       = [ z2 v2 ]
    // - '%e1'       = [ x0 y0 ]
    // - '%f1'       = [ z0 v0 ]
    // - '%e2'       = [ x1 y1 ]
    // - '%f2'       = [ z1 v1 ]
    // Implemented algorithm:
    // |x2|   |y0 * z1 - z0 * y1|
    // |y2| = |z0 * x1 - x0 * z1|
    // |z2|   |x0 * y1 - y0 * x1|
    asm (
    "vext.8 d6, %e2, %f2, #4 \n\t" // e2=[ x1 y1 ], f2=[ z1 v1 ] -> d6=[ v1 x1 ]
    "vext.8 d7, %e1, %f1, #4 \n\t" // e1=[ x0 y0 ], f1=[ z0 v0 ] -> d7=[ v0 x0 ]
    "vmul.f32 %e0, %f1, %e2  \n\t" // f1=[ z0 v0 ], e2=[ x1 y1 ] -> e0=[ z0 * x1, v0 * y1 ]
    "vmul.f32 %f0, %e1, d6   \n\t" // e1=[ x0 y0 ], d6=[ v1 x1 ] -> f0=[ x0 * v1, y0 * x1 ]
    "vmls.f32 %e0, %f2, %e1  \n\t" // f2=[ z1 v1 ], e1=[ x0 y0 ] -> e0=[ z0 * x1 - z1 * x0, v0 * y1 - v1 * y0 ] = [ y2, - ]
    "vmls.f32 %f0, %e2, d7   \n\t" // e2=[ x1 y1 ], d7=[ v0 x0 ] -> f0=[ x0 * v1 - x1 * v0, y0 * x1 - y1 * x0 ] = [  -, - ]
    "vext.8 %e0, %f0, %e0, #4    " // 
    : "+w" ( result )              // Output section: 'w'='VFP floating point register', '+'='read/write'
    : "w" ( lhs ), "w" ( rhs )     // Input section : 'w'='VFP floating point register'
    : "d6", "d7" );                // Temporary 64[bit] register.

Solution

First of all, this is weird. result isn't initialized before the asm statement, but it's used as an input/output operand with "+w" ( result ). I think "=w" (result) would be better. It also makes no sense that this is volatile; the output is a pure function of the inputs with no side effects or dependency on any "hidden" inputs, so the same inputs will give the same result every time. Thus, omitting volatile would allow the compiler to CSE it and hoist it out of loops if possible, instead of forcing it to re-compute every time the source runs it with the same inputs.

I couldn't find any reference either; the gcc manual's Extended ASM page only documents operand modifiers for x86, not ARM.

But I think we can see the operand modifiers do from looking at the asm output:

%e0 is substituted with d16, %f0 is substituted with d17. %e1 is d18 and %f1 is d19. %2 is in d20 and d21

Your inputs are 16-byte NEON vectors, in q registers. In ARM32, the upper and lower half of each q register is separately accessible as a d register. (Unlike AArch64 where each s / d register is the bottom element of a different q reg.) It looks like this code is taking advantage of this to shuffle for free by using 64-bit SIMD on the high and low pair of floats, after doing a 4-byte vext shuffle to mix those pairs of floats.

%e[operand] is the low d register of an operand, %f[operand] is the high d register. They're not documented, but the gcc source code says (in arm_print_operand in gcc/config/arm/arm.c#L22486:

These two codes print the low/high doubleword register of a Neon quad register, respectively. For pair-structure types, can also print low/high quadword registers.

I didn't test what happens if you apply these modifiers to 64-bit operands like float32x2_t, and this is all just me reverse-engineering from one example. But it makes perfect sense that there would be modifiers for this.

x86 modifiers include one for the low and high 8 bits of integer registers (so you can get AL / AH if your input as in EAX), so partial-register stuff is definitely something that GNU C inline asm operand modifiers can do.

Beware that undocumented means unsupported.