Search code examples
arraysperformancef#breakcil

F# Breakable Array Iteration With Bounds Checking Elided?


I am interested in trying F# in a high-performance application. I do not want to have a large array's bounds checked during iteration and the lack of break/return statements is concerning.

This is a contrived example that will break upon finding a value, but can someone tell me if bounds checking is also elided?

let innerExists (item: Char) (items: Char array): bool = 
    let mutable state = false
    let mutable i = 0
    while not state && i < items.Length do
        state <- item = items.[i]
        i <- i + 1

    state

let exists (input: Char array)(illegalChars: Char array): bool = 
    let mutable state = false
    let mutable i = 0
    while not state && i < input.Length do
        state <- innerExists input.[i] illegalChars
        i <- i + 1

    state

exists [|'A'..'z'|] [|'.';',';';'|]

Here is the relevant disassembly:

    while not state && i < input.Length do
000007FE6EB4237A  cmp         dword ptr [rbp-14h],0  
000007FE6EB4237E  jne         000007FE6EB42383  
000007FE6EB42380  nop  
000007FE6EB42381  jmp         000007FE6EB42386  
000007FE6EB42383  nop  
000007FE6EB42384  jmp         000007FE6EB423A9  
000007FE6EB42386  nop  
000007FE6EB42387  mov         r8d,dword ptr [rbp-18h]  
000007FE6EB4238B  mov         rdx,qword ptr [rbp+18h]  
000007FE6EB4238F  cmp         r8d,dword ptr [rdx+8]  
000007FE6EB42393  setl        r8b  
000007FE6EB42397  movzx       r8d,r8b  
000007FE6EB4239B  mov         dword ptr [rbp-24h],r8d  
000007FE6EB4239F  mov         r8d,dword ptr [rbp-24h]  
000007FE6EB423A3  mov         dword ptr [rbp-1Ch],r8d  
000007FE6EB423A7  jmp         000007FE6EB423B1  
000007FE6EB423A9  nop  
000007FE6EB423AA  xor         r8d,r8d  
000007FE6EB423AD  mov         dword ptr [rbp-1Ch],r8d  
000007FE6EB423B1  cmp         dword ptr [rbp-1Ch],0  
000007FE6EB423B5  je          000007FE6EB42409  
            state <- innerExists input.[i] illegalChars
000007FE6EB423B7  mov         r8d,dword ptr [rbp-18h]  
000007FE6EB423BB  mov         rdx,qword ptr [rbp+18h]  
000007FE6EB423BF  cmp         r8,qword ptr [rdx+8]  
000007FE6EB423C3  jb          000007FE6EB423CA  
000007FE6EB423C5  call        000007FECD796850  
000007FE6EB423CA  lea         rdx,[rdx+r8*2+10h]  
000007FE6EB423CF  movzx       r8d,word ptr [rdx]  
000007FE6EB423D3  mov         rdx,qword ptr [rbp+10h]  
000007FE6EB423D7  mov         rdx,qword ptr [rdx+8]  
000007FE6EB423DB  mov         r9,qword ptr [rbp+20h]  
000007FE6EB423DF  mov         rcx,7FE6EEE0640h  
000007FE6EB423E9  call        000007FE6EB41E40  
000007FE6EB423EE  mov         dword ptr [rbp-20h],eax  
000007FE6EB423F1  mov         eax,dword ptr [rbp-20h]  
000007FE6EB423F4  movzx       eax,al  
000007FE6EB423F7  mov         dword ptr [rbp-14h],eax  
            i <- i + 1
000007FE6EB423FA  mov         eax,dword ptr [rbp-18h]  

Solution

  • Others pointed out to use existing function FSharp.Core to achieve the same result but I think that OP asks if in loops like the boundary check of arrays are elided (as it is checked in the loop condition).

    For simple code like above the jitter should be able to elide the checks. To see this it is correct to check the assembly code but it is important to not run with the VS debugger attached as the jitter don't optimize the code then. The reason that it can be impossible to show correct values in the debugger.

    First let's look at exists optimized x64:

    ; not state?
    00007ff9`1cd37551 85c0            test    eax,eax
    ; if state is true then exit the loop
    00007ff9`1cd37553 7521            jne     00007ff9`1cd37576
    ; i < input.Length?
    00007ff9`1cd37555 395e08          cmp     dword ptr [rsi+8],ebx
    ; Seems overly complex but perhaps this is as good as it gets?
    00007ff9`1cd37558 0f9fc1          setg    cl
    00007ff9`1cd3755b 0fb6c9          movzx   ecx,cl
    00007ff9`1cd3755e 85c9            test    ecx,ecx
    ; if we have reached end of the array then exit
    00007ff9`1cd37560 7414            je      00007ff9`1cd37576
    ; mov i in ebx to rcx, unnecessary but moves like these are very cheap
    00007ff9`1cd37562 4863cb          movsxd  rcx,ebx
    ; input.[i] (note we don't check the boundary again)
    00007ff9`1cd37565 0fb74c4e10      movzx   ecx,word ptr [rsi+rcx*2+10h]
    ; move illegalChars pointer to rdx
    00007ff9`1cd3756a 488bd7          mov     rdx,rdi
    ; call innerExists
    00007ff9`1cd3756d e8ee9affff      call    00007ff9`1cd31060
    ; i <- i + 1
    00007ff9`1cd37572 ffc3            inc     ebx
    ; Jump top of loop
    00007ff9`1cd37574 ebdb            jmp     00007ff9`1cd37551
    ; We are done!
    00007ff9`1cd37576
    

    So the code looks a bit too complex for what it should need to be but it seems it only checks the array condition once.

    Now let's look at innerExists optimized x64:

    # let mutable state = false
    00007ff9`1cd375a0 33c0            xor     eax,eax
    # let mutable i = 0
    00007ff9`1cd375a2 4533c0          xor     r8d,r8d
    ; not state?
    00007ff9`1cd375a5 85c0            test    eax,eax
    ; if state is true then exit the loop
    00007ff9`1cd375a7 752b            jne     00007ff9`1cd375d4
    ; i < items.Length
    00007ff9`1cd375a9 44394208        cmp     dword ptr [rdx+8],r8d
    ; Seems overly complex but perhaps this is as good as it gets?
    00007ff9`1cd375ad 410f9fc1        setg    r9b
    00007ff9`1cd375b1 450fb6c9        movzx   r9d,r9b
    00007ff9`1cd375b5 4585c9          test    r9d,r9d
    ; if we have reached end of the array then exit
    00007ff9`1cd375b8 741a            je      00007ff9`1cd375d4
    ; mov i in r8d to rax, unnecessary but moves like these are very cheap
    00007ff9`1cd375ba 4963c0          movsxd  rax,r8d
    ; items.[i] (note we don't check the boundary again)
    00007ff9`1cd375bd 0fb7444210      movzx   eax,word ptr [rdx+rax*2+10h]
    ; mov item in cx to r9d, unnecessary but moves like these are very cheap
    00007ff9`1cd375c2 440fb7c9        movzx   r9d,cx
    ; item = items.[i]?
    00007ff9`1cd375c6 413bc1          cmp     eax,r9d
    00007ff9`1cd375c9 0f94c0          sete    al
    ; state <- ?
    00007ff9`1cd375cc 0fb6c0          movzx   eax,al
    ; i <- i + 1
    00007ff9`1cd375cf 41ffc0          inc     r8d
    ; Jump top of loop
    00007ff9`1cd375d2 ebd1            jmp     00007ff9`1cd375a5
    ; We are done!
    00007ff9`1cd375d4 c3              ret
    

    So looks overly complex for what it should be but at least it looks like it only checks the array condition once.

    So finally, it looks like the jitter eliminates the array boundary checks because it can prove this has already been checked successfully in the loop condition which I believe is what the OP wondered.

    The x64 code doesn't look as clean as it could but from my experimentation x64 code that is cleaned up doesn't perform that much better, I suspect the CPU vendors optimize the CPU for the crappy code jitters produce.

    An interesting exercise would be to code up an equivalent program in C++ and run it through https://godbolt.org/, choose x86-64 gcc (trunk) (gcc seems to do best right now) and specify the options -O3 -march=native and see the resulting x64 code.

    Update

    The code rewritten in https://godbolt.org/ to allow us seeing the assembly code generated by a c++ compiler:

    template<int N>
    bool innerExists(char item, char const (&items)[N]) {
        for (auto i = 0; i < N; ++i) {
            if (item == items[i]) return true;
        }
        return false;
    }
    
    template<int N1, int N2>
    bool exists(char const (&input)[N1], char const (&illegalCharacters)[N2]) {
        for (auto i = 0; i < N1; ++i) {
            if (innerExists(input[i], illegalCharacters)) return true;
        }
        return false;
    }
    
    char const separators[] = { '.', ',', ';' };
    char const str[58] = {  };
    
    bool test() {
      return exists(str, separators);
    }
    

    x86-64 gcc (trunk) with options -O3 -march=native the following code is generated

    ; Load the string to test into edx
    mov edx, OFFSET FLAT:str+1
    .L2:
    ; Have we reached the end?
    cmp rdx, OFFSET FLAT:str+58
    ; If yes, then jump to the end
    je .L7
    ; Load a character
    movzx ecx, BYTE PTR [rdx]
    ; Comparing the 3 separators are encoded in the assembler
    ;  because the compiler detected the array is always the same
    mov eax, ecx
    and eax, -3
    cmp al, 44
    sete al
    cmp cl, 59
    sete cl
    ; increase outer i
    inc rdx
    ; Did we find a match?
    or al, cl
    ; If no then loop to .L2
    je .L2
    ; We are done!
    ret
    .L7:
    ; No match found, clear result
    xor eax, eax
    ; We are done!
    ret
    

    Looks pretty good but what I missing in the code above is using AVX to test multiple characters at once.