Search code examples
cgccgdbtrace

Tracing hardware related numerical differences down to the instruction


I have compiled a numerical simulation model from C into an ELF binary with GCC (shared object with file extension .mexa64, because loaded into MATLAB). It used no debug or optimization flags. It uses only standard math library functions, no BLAS or LAPACK calls. Compiler information is as follows:

> objdump -s --section .comment my.mexa64

my.mexa64:     file format elf64-x86-64  

Contents of section .comment:
 0000 4743433a 20285562 756e7475 20392e35  GCC: (Ubuntu 9.5
 0010 2e302d31 7562756e 7475317e 32322e30  .0-1ubuntu1~22.0
 0020 34292039 2e352e30 00                 4) 9.5.0.

Now, this binary seems to produce different results when run on an AMD 6850U vs. an Intel XEON Gold 6142 CPU.

In both cases I used the same docker image to run the simulation. There should be no software related influences on the result. Colleagues can also replicate these differences on only some of their laptops (current hypothesis: these also have different CPUs). I assume it is hardware differences.

In a first attempt on tracing the problem, I print all the simulation inputs and outputs to the shell and diff those. Only the outputs differ. The differences of the outputs are in the order $10^{-12}$. This is sufficient in our case that over time, the algorithm that depends on the outputs of the model produces significantly different results.

In the next step I would like to trace the registers during the simulation and log that to file. From the traces, I would like to learn which operation makes the simulations diverge. How can I use tools like gdb to trace the differences down to the first instruction that affects it?


Solution

  • The differences of the outputs are in the order $10^{-12}$. This is sufficient in our case that over time, the algorithm that depends on the outputs of the model produces significantly different results.

    If your algorithm is so sensitive to numerical difference of 10-12, then it is numerically unstable and you can't trust any of its results.


    How can I use tools like gdb to trace the differences down to the first instruction that affects it?

    You could certainly use GDB to do this if the program is single-threaded:

    1. Disable ASLR
    2. In GDB, single-step the program
    3. Use GDB gcore command to save a complete memory image into core.$count
    4. Increment $count
    5. go to step 2

    Comparing core.$count from the two machines should give you the $count at which they differ (a minor complication is that there could be differences due to date/time, but it should be relatively easy to ignore these).

    The problem with this approach is that you are going to need terabytes to store all these cores, and it will probably take years to single-step a non-trivial program.


    A better approach is the use reversible debugger, such as rr.

    You can run the program under rr to completion on both machines. Now go back in execution to the point where a variable with different value was last changed. The instruction there either had identical inputs (in which case this gives you one instruction producing numerical difference), or it had different inputs (in which case you now need to trace back to where these inputs came from).

    This wouldn't necessarily give you the first instruction with numerical differences, and this approach is labor-intensive, but will probably not take years to complete.


    An even better approach is to modify your program so it regularly saves all relevant state into a series of snapshots (e.g. every loop iteration).

    You can then find the first snapshot with a difference between the two machines, and use divide and conquer to bisect the difference to a small part of your program.

    Once you know a small-enough part, you can either disassemble that part and eye-ball it, or use rr to find the exact instruction which is producing a different result.


    P.S. Comments on this question are probably relevant here.