How to record trace of assembly instructions and their corresponding timestamps of a C program on macOS?

I have the following C program:

int main() {    
    float number1, number2, sum=0.;
    number1 = .5;
    number2 = .3;
    while(sum > -10000000.)
        sum -= number1 + number2;        
    printf("%f",sum);
    return 0;
}

Its corresponding assembly is as follows:

_main:                                  ; @main
    .cfi_startproc
; %bb.0:
    sub sp, sp, #16                     ; =16
    .cfi_def_cfa_offset 16
    str wzr, [sp, #12]
    str wzr, [sp]
    mov w8, #1056964608
    str w8, [sp, #8]
    mov w8, #39322
    movk    w8, #16025, lsl #16
    str w8, [sp, #4]
LBB0_1:                                 ; =>This Inner Loop Header: Depth=1
    ldr s0, [sp]
    fcvt    d0, s0
    adrp    x8, lCPI0_0@PAGE
    ldr d1, [x8, lCPI0_0@PAGEOFF]
    fcmp    d0, d1
    b.le    LBB0_3
; %bb.2:                                ;   in Loop: Header=BB0_1 Depth=1
    ldr s0, [sp, #8]
    ldr s1, [sp, #4]
    fadd    s1, s0, s1
    ldr s0, [sp]
    fsub    s0, s0, s1
    str s0, [sp]
    b   LBB0_1
LBB0_3:
    mov w0, #0
    add sp, sp, #16                     ; =16
    ret
    .cfi_endproc
                                        ; -- End function
.subsections_via_symbols

I want to analyse latency of each instructions so I'm looking for ways to obtain program counter trace.

Desired output is as follows:

0000000000 _main:                                  ; @main
0000000001  .cfi_startproc
0000000002; %bb.0:
0000000003  sub sp, sp, #16                     ; =16
0000000004  .cfi_def_cfa_offset 16
0000000005  str wzr, [sp, #12]
0000000006  str wzr, [sp]
0000000007  mov w8, #1056964608
0000000008  str w8, [sp, #8]
0000000009  mov w8, #39322
0000000010  movk    w8, #16025, lsl #16
0000000011  str w8, [sp, #4]
...

where the first columns is the timestamp either in pico/nano/microseconds.

Target system is macOS, compiler is llvm, debugger is lldb.

Solution

There is no way to precisely measure the instruction time at the granularity of few cycles (at least not on this target architecture). Thus, you cannot measure the latency of one specific instruction unless it is a very slow one. The reason is that the best instructions used to measure the time are themselves pretty long and the processor can execute multiple instructions per cycles and in an out of order way (not to mention they are pipelined). This is especially true for the M1 processor you appear to run on. On ARM, the way to measure time seems to read the PMCCNTR based on this post. You certainly need to care about the superscalar out-of-order execution even with such instruction though. The delay taken by such instruction is dependent of the target architecture and AFAIK there is no official public information targetting the M1 on this topic (in fact, the documentation is pretty scarce on the way the M1 execute instructions so far).

An alternative solution is to simulate the execution of the code with LLVM-MCA which performs a static analysis of the program so to simulate the scheduling of the instructions on the target architecture. The static analysis has a big downside: the actual runtime behaviour of loops and conditional jumps is not considered.

Note that profiling a non-optimized code is generally a huge waste of time as it does not reflect the actual execution of the release version (which should be optimized). Once optimized, the code is likely bounded by the dependency chain on sum. This is especially true on the M1 processor which can execute a lot of instructions in parallel on a same (big/performance) core.