Search code examples
delphix86-64micro-optimizationiaca

Using IACA with non-assembly routine


I've been playing around with IACA (Intel's static code analyser).
It works fine when testing with assembly snippets where I can input the magic marker bytes manually, like this:

procedure TSlice.BitSwap(a, b: integer);
asm
  //RCX = self
  //edx = a
  //r8d = b

  mov ebx, 111      // Start IACA marker bytes
  db $64, $67, $90  // Start IACA marker bytes

  xor eax, eax
  xor r10d, r10d

  mov r9d, [rcx]  // read the value
  mov ecx,edx     // need a in cl for the shift
  btr r9d, edx    // read and clear the a bit

  setc al         // convert cf to bit
  shl eax, cl     // shift bit to ecx position

  btr r9d, r8d    // read and clear the b bit

  mov ecx, r8d    // need b in ecx for shift
  setc r10b       // convert cf to bit
  shl r10d, cl    // shift bit to edx position

  or r9d, eax     // copy in old edx bit
  or r9d, r10d    // copy in old ecx bit

  mov [r8], r9d   // store result
  ret

  mov ebx, 222      // End IACA marker bytes
  db $64, $67, $90  // End IACA marker bytes
end;

Is there a way to prefix/suffix non assembly code with the required magic markers so that I can analyse the compiler generated code?

I know I can copy-paste the generated assembly from the CPU view and create a routine using that, but I was hoping there is an easier workflow

EDIT
I'm looking for solutions that work in the 64-bit compiler. I know I can mix assembly and normal code in the 32-bit compiler.

UPDATE
@Dsm's suggestion works. @Rudy's trick does not.

The following dummy code works:

Throughput Analysis Report
--------------------------
Block Throughput: 13.33 Cycles       Throughput Bottleneck: Dependency chains (possibly between iterations)

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 1.3    0.0  | 1.4  | 1.0    1.0  | 1.0    1.0  | 0.0  | 1.4  | 2.0  | 0.0  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   3^   | 0.3       | 0.3 | 1.0   1.0 |           |     | 0.3 | 1.0 |     | CP | ret
|   X    |           |     |           |           |     |     |     |     |    | int3
[... more int3's]
|   X    |           |     |           |           |     |     |     |     |    | int3
|   1    | 1.0       |     |           |           |     |     |     |     |    | shl eax, 0x10
|   1    |           | 0.6 |           |           |     | 0.3 |     |     |    | cmp eax, 0x64
|   3^   |           | 0.3 |           | 1.0   1.0 |     | 0.6 | 1.0 |     | CP | ret
|   X    |           |     |           |           |     |     |     |     |    | int3
|   X    |           |     |           |           |     |     |     |     |    | int3
[...]
Total Num Of Uops: 8

UPDATE 2
If there is a call statement in there IACA seems to bomb and not want to analyse the code. Complaining about illegal instructions. However the basic idea works. Obviously you need to subtract the initial ret and its associated cost.


Solution

  • I don't use IACA so I can't test this idea, and I will delete the answer if it does not work, but can you not just do something like this:

    procedure TForm10.Button1Click(Sender: TObject);
    begin
      asm
        //RCX = self
        //edx = a
        //r8d = b
    
        mov ebx, 111      // Start IACA marker bytes
        db $64, $67, $90  // Start IACA marker bytes
      end;
    
      fRotate( fLine - Point(0,1), 23 );
    
      asm
        mov ebx, 222      // End IACA marker bytes
        db $64, $67, $90  // End IACA marker bytes
    
      end;
    end;
    

    This was just a sample routine from something else to check that it compiles, which it does.

    Sadly this only works for 32 bit - as Johan points out it is not allowed for 64 bit.

    For 64 bit the following may work, but again I cannot test it.

    procedure TForm10.Button1Click(Sender: TObject);
      procedure Test1;
      asm
        //RCX = self
        //edx = a
        //r8d = b
    
        mov ebx, 111      // Start IACA marker bytes
        db $64, $67, $90  // Start IACA marker bytes
      end;
      procedure Test2;
      begin
        fRotate( fLine - Point(0,1), 23 );
      end;
      procedure Test3;
      asm
        mov ebx, 222      // End IACA marker bytes
        db $64, $67, $90  // End IACA marker bytes
    
      end;
    begin
      Test1;
      Test2;
      Test3;
    end;