c assembly compiler-optimization floating-accuracy powerpc

Assembly differences between unrolled for-loops cause differing float results

Consider the below setup:

typedef struct
{
    float d;
} InnerStruct;

typedef struct
{
    InnerStruct **c;
} OuterStruct;


float TestFunc(OuterStruct *b)
{
    float a = 0.0f;
    for (int i = 0; i < 8; i++)
        a += b->c[i]->d;
    return a;
}

The for loop in TestFunc exactly replicates one in another function that I'm testing. Both loops are unrolled by gcc (4.9.2) but yield slightly different assembly after doing so.

Assembly for my test loop:ㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤAssembly for the original loop:

lwz       r9,-0x725C(r13)                   lwz       r9,0x4(r3)    
lwz       r8,0x4(r9)                        lwz       r8,0x8(r9)    
lwz       r10,0x0(r9)                       lwz       r10,0x4(r9)   
lwz       r11,0x8(r9)                       lwz       r11,0x0C(r9)  
lwz       r4,0x4(r8)                        lwz       r3,0x4(r8)    
lwz       r10,0x4(r10)                      lwz       r10,0x4(r10)  
lwz       r8,0x4(r11)                       lwz       r0,0x4(r11)   
lwz       r11,0x0C(r9)                      lwz       r11,0x10(r9)  
efsadd    r4,r4,r10                         efsadd    r3,r3,r10
lwz       r10,0x10(r9)                      lwz       r8,0x14(r9)   
lwz       r7,0x4(r11)                       lwz       r10,0x4(r11)  
lwz       r11,0x14(r9)                      lwz       r11,0x18(r9)  
efsadd    r4,r4,r8                          efsadd    r3,r3,r0
lwz       r8,0x4(r10)                       lwz       r0,0x4(r8)    
lwz       r10,0x4(r11)                      lwz       r8,0x0(r9)    
lwz       r11,0x18(r9)                      lwz       r11,0x4(r11)  
efsadd    r4,r4,r7                          efsadd    r3,r3,r10
lwz       r9,0x1C(r9)                       lwz       r10,0x1C(r9)  
lwz       r11,0x4(r11)                      lwz       r9,0x4(r8)    
lwz       r9,0x4(r9)                        efsadd    r3,r3,r0
efsadd    r4,r4,r8                          lwz       r0,0x4(r10)   
efsadd    r4,r4,r10                         efsadd    r3,r3,r11
efsadd    r4,r4,r11                         efsadd    r3,r3,r9
efsadd    r4,r4,r9                          efsadd    r3,r3,r0

The issue is the float values these instructions return are not exactly the same. And I can't change the original loop. I need to modify the test loop somehow to return the same values. I believe the test's assembly is equivalent to just adding each element one after another. I'm not very familiar with assembly so I wasn't sure how the above differences translated into c. I know this is the issue because if I add a print to the loops, they don't unroll and the results match exactly as expected.

Solution

Disabling fast-math seems to fix this issue. Thanks to @njuffa for the suggestion. I was hoping to be able to design the test function around this optimization, but it doesn't seem to be possible. At least I know what the issue is now. Appreciate everyone's help on the problem!