Search code examples
javascriptperformanceloopsv8spidermonkey

Javascript Performance: Modulus operation of negative Number within decrementing loop slowing the code by more than 100%


I was going through Eloquent JavaScript (again) and came across exercise "Chess Board" of Chapter 2. I had my one decent version of solution written back in the day when I was first time reading it, and another version of solution provided at the Elequent Javascript website. I'm one of the newbies that wanna be super-efficient programmers with only one question in their head: "Can I make it work a tad faster or smaller in anyway?"

So, during my search on the web few months ago, I came across a question on Stack Overflow, regarding for loop vs while loop on basis of performance. Since in that thread it was mentioned for loops are slower than while and loops with decrementing iterator are faster so I rewrote the code for better performance.

Here's the new version with for replaced with while and conditions edited for decrementing:

console.time("looping");
var gridSize = 5000, str = '', i = gridSize, j;
while (i--) {
  j = gridSize;
  while (j--) {
    if ((i - j) % 2 === 0)
      str += " ";
    else
      str += "#";
  }
  str += "\n";
}

//console.log(str);
console.timeEnd("looping");

But to my surprise this code had even worse performance. Then, after a while I found out that if ((i - j) % 2 === 0) was the main culprit, replacing minus sign with plus reduced the total time of execution to ~ 750ms

//Checked on NODE.js = v6.11.2
Book version of code         -->    893.76 ms
while loop with subtraction  -->    1562.43 ms
while loop with addition     -->    749.62 ms


//firefox Benchmark v54.0 (64-bit) (OS Ubuntu 16.04)
Book version of code         -->    725.10 ms
while loop with subtraction  -->    1565.29 ms
while loop with addition     -->    601.12 ms

Why is subtraction having such huge impact on total execution time?

Edit 01 6:30 AM (GMT) 8th August

After looking at @jaromandaX answer I'm Pretty sure that it is not the subtracting thats slowing down this loop, its the modulo of negative Number. Again I wanna know what makes modulo of negative number slower


Solution

  • This is far from a full answer and requires further investigation (or insights from someone who knows details of V8 implementation). Still, here are my findings:

    Sidenode: results were gathered running Node.JS using following command line:

    node --expose-gc --print-code --code-comments --print-opt-code --trace-hydrogen --redirect-code-traces --redirect-code-traces-to=code.asm --trace_representation --trace-deopt --trace-opt 1.js

    and a bit of looking into the V8 source code.

    1. Performance difference comes from the fact that different machine code is generated in those case. For + the code for % is

                      ;;; <@134,#123> add-i
    00000151A32DD74B   395  03c2           addl rax,rdx
    00000151A32DD74D   397  0f807a030000   jo 1293  (00000151A32DDACD)
                      ;;; <@136,#126> mod-by-power-of-2-i
    00000151A32DD753   403  85c0           testl rax,rax
    00000151A32DD755   405  790f           jns 422  (00000151A32DD766)
    00000151A32DD757   407  f7d8           negl rax
    00000151A32DD759   409  83e001         andl rax,0x1
    00000151A32DD75C   412  f7d8           negl rax
    00000151A32DD75E   414  0f846e030000   jz 1298  (00000151A32DDAD2)
    00000151A32DD764   420  eb03           jmp 425  (00000151A32DD769)
    00000151A32DD766   422  83e001         andl rax,0x1
                      ;;; <@138,#200> smi-tag
    00000151A32DD769   425  8bd8           movl rbx,rax
    00000151A32DD76B   427  48c1e320       REX.W shlq rbx, 32
                      ;;; <@140,#130> gap
    00000151A32DD76F   431  488bc2         REX.W movq rax,rdx
    

    while for - the code is much more complicated:

                      ;;; <@136,#123> sub-i
    00000151A32E57E1   417  412bc3         subl rax,r11
    00000151A32E57E4   420  0f8039040000   jo 1507  (00000151A32E5C23)
                      ;;; <@138,#200> int32-to-double
    00000151A32E57EA   426  c5f957c0       vxorpd xmm0,xmm0,xmm0
    00000151A32E57EE   430  c5fb2ac0       vcvtlsi2sd xmm0,xmm0,rax
                      ;;; <@139,#200> gap
    00000151A32E57F2   434  c5f928ca       vmovapd xmm1,xmm2
                      ;;; <@140,#126> mod-d
    00000151A32E57F6   438  4989e2         REX.W movq r10,rsp
    00000151A32E57F9   441  4883ec28       REX.W subq rsp,0x28
    00000151A32E57FD   445  4883e4f0       REX.W andq rsp,0xf0
    00000151A32E5801   449  4c89542420     REX.W movq [rsp+0x20],r10
    00000151A32E5806   454  48b830d4124001000000 REX.W movq rax,000000014012D430
    00000151A32E5810   464  ffd0           call rax
    00000151A32E5812   466  488b642420     REX.W movq rsp,[rsp+0x20]
                      ;;; <@142,#126> lazy-bailout
                      ;;; <@144,#202> number-tag-d
    00000151A32E5817   471  498b9dc06f0400 REX.W movq rbx,[r13+0x46fc0]
    00000151A32E581E   478  488bc3         REX.W movq rax,rbx
    00000151A32E5821   481  4883c010       REX.W addq rax,0x10
    00000151A32E5825   485  493b85c86f0400 REX.W cmpq rax,[r13+0x46fc8]
    00000151A32E582C   492  0f878f030000   ja 1409  (00000151A32E5BC1)
    00000151A32E5832   498  498985c06f0400 REX.W movq [r13+0x46fc0],rax
    00000151A32E5839   505  48ffc3         REX.W incq rbx
    00000151A32E583C   508  4d8b5560       REX.W movq r10,[r13+0x60]
    00000151A32E5840   512  4c8953ff       REX.W movq [rbx-0x1],r10
    00000151A32E5844   516  c5fb114307     vmovsd [rbx+0x7],xmm0
                      ;;; <@146,#130> gap
    00000151A32E5849   521  488b45a0       REX.W movq rax,[rbp-0x60]
    00000151A32E584D   525  488b7db8       REX.W movq rdi,[rbp-0x48]
    00000151A32E5851   529  488b75c0       REX.W movq rsi,[rbp-0x40]
    00000151A32E5855   533  488b4dc8       REX.W movq rcx,[rbp-0x38]
    00000151A32E5859   537  488b55b0       REX.W movq rdx,[rbp-0x50]
    00000151A32E585D   541  4c8b4da8       REX.W movq r9,[rbp-0x58]
    00000151A32E5861   545  4c8b4598       REX.W movq r8,[rbp-0x68]
    00000151A32E5865   549  c5fb109578ffffff vmovsd xmm2,[rbp-0x88]
    

    In short the difference is that for the "plus" case Mod (%) operation is performed using highly specialized mod-by-power-of-2-i machine code but for the "minus" case it is done using mod-d which is floating point-based arithmetic implementation.

    Notice also that mod-by-power-of-2-i machine code does support negative values. It can be roughly re-written as something like this:

    if (rax < 0) {
        rax = -rax;
        rax = rax & 1;
        rax = -rax;
    }
    else {
        rax = rax & 1;
    }
    

    So this is not a case of optimized machine code only for positive values.

    2. The difference in generated code seem to come from the fact that type inference works differently. Logs produced by --trace_representation show the following difference between "plus" and "minus" cases for a simplified program:

    var f_minus = function(log) {
        var str = '', i = gridSize, j;
        while (i--) {
          j = gridSize;
          while (j--) {
            var ttt = (i - j) % 2
          }
        }
    
      if(log) {
         if(ttt == -1)
            console.log(t);
       }
    }
    
    
    var f_plus = function(log) {
        var str = '', i = gridSize, j;
        while (i--) {
          j = gridSize;
          while (j--) {
            var ttt = (i + j) % 2
          }
        }
    
        if(log){
         if(ttt == -1)
            console.log(t);
        }
    }
    

    Compare

    [marking 00000025D4303E91 <JS Function f_minus (SharedFunctionInfo 00000278933F61C1)> for optimized recompilation, reason: small function, ICs with typeinfo: 8/12 (66%), generic ICs: 0/12 (0%)]
    [compiling method 00000025D4303E91 <JS Function f_minus (SharedFunctionInfo 00000278933F61C1)> using Crankshaft OSR]
    #37 Phi is used by real #110 Branch as v
    #38 Phi is used by real #58 Add as t
    #38 Phi is used by real #69 StackCheck as v
    #38 Phi is used by real #70 LoadContextSlot as v
    #38 Phi is used by real #122 CompareGeneric as t
    #38 Phi is used by real #132 LoadGlobalGeneric as v
    #38 Phi is used by real #134 LoadNamedGeneric as v
    #38 Phi is used by real #136 LoadGlobalGeneric as v
    #38 Phi is used by real #141 CallWithDescriptor as v
    #38 Phi is used by real #156 Return as v
    #38 Phi is used by real #101 Mod as t
    #38 Phi is used by real #98 Sub as t
    #38 Phi is used by real #95 StackCheck as v
    #38 Phi is used by real #84 Add as t
    #47 Phi is used by real #56 ForceRepresentation as s
    #49 Phi is used by real #122 CompareGeneric as t
    #77 Phi is used by real #83 ForceRepresentation as s
    generalizing use representation 'v' of #40 Phi with uses of #47 Phi 's'
    generalizing use representation 'v' of #42 Phi with uses of #49 Phi 't'
    generalizing use representation 't' of #42 Phi with uses of #78 Phi 'v'
    generalizing use representation 'v' of #49 Phi with uses of #78 Phi 'v'
    generalizing use representation 'v' of #78 Phi with uses of #49 Phi 't'
    Changing #101 Mod representation v -> i based on inputs
    Changing #101 Mod representation i -> d based on output
    Changing #98 Sub representation v -> s based on inputs
    Changing #98 Sub representation s -> i based on use requirements
    Changing #84 Add representation v -> i based on inputs
    ...
    

    to this

    [marking 000002C17CAAB341 <JS Function f_plus (SharedFunctionInfo 00000278933F6289)> for optimized recompilation, reason: small function, ICs with typeinfo: 8/12 (66%), generic ICs: 0/12 (0%)]
    [compiling method 000002C17CAAB341 <JS Function f_plus (SharedFunctionInfo 00000278933F6289)> using Crankshaft OSR]
    #37 Phi is used by real #110 Branch as v
    #38 Phi is used by real #58 Add as t
    #38 Phi is used by real #69 StackCheck as v
    #38 Phi is used by real #70 LoadContextSlot as v
    #38 Phi is used by real #122 CompareGeneric as t
    #38 Phi is used by real #132 LoadGlobalGeneric as v
    #38 Phi is used by real #134 LoadNamedGeneric as v
    #38 Phi is used by real #136 LoadGlobalGeneric as v
    #38 Phi is used by real #141 CallWithDescriptor as v
    #38 Phi is used by real #156 Return as v
    #38 Phi is used by real #101 Mod as t
    #38 Phi is used by real #98 Add as t
    #38 Phi is used by real #95 StackCheck as v
    #38 Phi is used by real #84 Add as t
    #47 Phi is used by real #56 ForceRepresentation as s
    #49 Phi is used by real #122 CompareGeneric as t
    #77 Phi is used by real #83 ForceRepresentation as s
    generalizing use representation 'v' of #40 Phi with uses of #47 Phi 's'
    generalizing use representation 'v' of #42 Phi with uses of #49 Phi 't'
    generalizing use representation 't' of #42 Phi with uses of #78 Phi 'v'
    generalizing use representation 'v' of #49 Phi with uses of #78 Phi 'v'
    generalizing use representation 'v' of #78 Phi with uses of #49 Phi 't'
    Changing #101 Mod representation v -> i based on inputs
    Changing #98 Add representation v -> s based on inputs
    Changing #98 Add representation s -> i based on use requirements
    Changing #84 Add representation v -> i based on inputs
    ...
    

    The interesting difference is the line

    Changing #101 Mod representation i -> d based on output
    

    that is only present in the f_minus but not the f_plus case. For some reason the compiler believes that in the f_minus case type should be Double instead of Integer basing the guess on the output value. Interestingly, if I change the line

            var ttt = (i - j) % 2
    

    to

            var ttt = (i - j + gridSize) % 2; 
    

    it again starts generating fast mod-by-power-of-2-i code. So yes, it looks probable that the output value affects optimizing compiler. But it is not clear why this happens in this particular case.

    At the first glance this behavior looks like a bug in the optimizing compiler that fails to notice that the "minus" case can be handled by mod-by-power-of-2-i as well. I was not able to trace why this happens just glancing over the source code so further input is welcome.