I was going through Eloquent JavaScript (again) and came across exercise "Chess Board" of Chapter 2. I had my one decent version of solution written back in the day when I was first time reading it, and another version of solution provided at the Elequent Javascript website. I'm one of the newbies that wanna be super-efficient programmers with only one question in their head: "Can I make it work a tad faster or smaller in anyway?"
So, during my search on the web few months ago, I came across a question on Stack Overflow, regarding for
loop vs while
loop on basis of performance. Since in that thread it was mentioned for
loops are slower than while
and loops with decrementing iterator are faster so I rewrote the code for better performance.
Here's the new version with for
replaced with while
and conditions edited for decrementing:
console.time("looping");
var gridSize = 5000, str = '', i = gridSize, j;
while (i--) {
j = gridSize;
while (j--) {
if ((i - j) % 2 === 0)
str += " ";
else
str += "#";
}
str += "\n";
}
//console.log(str);
console.timeEnd("looping");
But to my surprise this code had even worse performance. Then, after a while I found out that if ((i - j) % 2 === 0)
was the main culprit, replacing minus sign with plus reduced the total time of execution to ~ 750ms
//Checked on NODE.js = v6.11.2
Book version of code --> 893.76 ms
while loop with subtraction --> 1562.43 ms
while loop with addition --> 749.62 ms
//firefox Benchmark v54.0 (64-bit) (OS Ubuntu 16.04)
Book version of code --> 725.10 ms
while loop with subtraction --> 1565.29 ms
while loop with addition --> 601.12 ms
Why is subtraction having such huge impact on total execution time?
After looking at @jaromandaX answer I'm Pretty sure that it is not the subtracting thats slowing down this loop, its the modulo of negative Number. Again I wanna know what makes modulo of negative number slower
This is far from a full answer and requires further investigation (or insights from someone who knows details of V8 implementation). Still, here are my findings:
Sidenode: results were gathered running Node.JS using following command line:
node --expose-gc --print-code --code-comments --print-opt-code --trace-hydrogen --redirect-code-traces --redirect-code-traces-to=code.asm --trace_representation --trace-deopt --trace-opt 1.js
and a bit of looking into the V8 source code.
1. Performance difference comes from the fact that different machine code is generated in those case. For +
the code for %
is
;;; <@134,#123> add-i
00000151A32DD74B 395 03c2 addl rax,rdx
00000151A32DD74D 397 0f807a030000 jo 1293 (00000151A32DDACD)
;;; <@136,#126> mod-by-power-of-2-i
00000151A32DD753 403 85c0 testl rax,rax
00000151A32DD755 405 790f jns 422 (00000151A32DD766)
00000151A32DD757 407 f7d8 negl rax
00000151A32DD759 409 83e001 andl rax,0x1
00000151A32DD75C 412 f7d8 negl rax
00000151A32DD75E 414 0f846e030000 jz 1298 (00000151A32DDAD2)
00000151A32DD764 420 eb03 jmp 425 (00000151A32DD769)
00000151A32DD766 422 83e001 andl rax,0x1
;;; <@138,#200> smi-tag
00000151A32DD769 425 8bd8 movl rbx,rax
00000151A32DD76B 427 48c1e320 REX.W shlq rbx, 32
;;; <@140,#130> gap
00000151A32DD76F 431 488bc2 REX.W movq rax,rdx
while for -
the code is much more complicated:
;;; <@136,#123> sub-i
00000151A32E57E1 417 412bc3 subl rax,r11
00000151A32E57E4 420 0f8039040000 jo 1507 (00000151A32E5C23)
;;; <@138,#200> int32-to-double
00000151A32E57EA 426 c5f957c0 vxorpd xmm0,xmm0,xmm0
00000151A32E57EE 430 c5fb2ac0 vcvtlsi2sd xmm0,xmm0,rax
;;; <@139,#200> gap
00000151A32E57F2 434 c5f928ca vmovapd xmm1,xmm2
;;; <@140,#126> mod-d
00000151A32E57F6 438 4989e2 REX.W movq r10,rsp
00000151A32E57F9 441 4883ec28 REX.W subq rsp,0x28
00000151A32E57FD 445 4883e4f0 REX.W andq rsp,0xf0
00000151A32E5801 449 4c89542420 REX.W movq [rsp+0x20],r10
00000151A32E5806 454 48b830d4124001000000 REX.W movq rax,000000014012D430
00000151A32E5810 464 ffd0 call rax
00000151A32E5812 466 488b642420 REX.W movq rsp,[rsp+0x20]
;;; <@142,#126> lazy-bailout
;;; <@144,#202> number-tag-d
00000151A32E5817 471 498b9dc06f0400 REX.W movq rbx,[r13+0x46fc0]
00000151A32E581E 478 488bc3 REX.W movq rax,rbx
00000151A32E5821 481 4883c010 REX.W addq rax,0x10
00000151A32E5825 485 493b85c86f0400 REX.W cmpq rax,[r13+0x46fc8]
00000151A32E582C 492 0f878f030000 ja 1409 (00000151A32E5BC1)
00000151A32E5832 498 498985c06f0400 REX.W movq [r13+0x46fc0],rax
00000151A32E5839 505 48ffc3 REX.W incq rbx
00000151A32E583C 508 4d8b5560 REX.W movq r10,[r13+0x60]
00000151A32E5840 512 4c8953ff REX.W movq [rbx-0x1],r10
00000151A32E5844 516 c5fb114307 vmovsd [rbx+0x7],xmm0
;;; <@146,#130> gap
00000151A32E5849 521 488b45a0 REX.W movq rax,[rbp-0x60]
00000151A32E584D 525 488b7db8 REX.W movq rdi,[rbp-0x48]
00000151A32E5851 529 488b75c0 REX.W movq rsi,[rbp-0x40]
00000151A32E5855 533 488b4dc8 REX.W movq rcx,[rbp-0x38]
00000151A32E5859 537 488b55b0 REX.W movq rdx,[rbp-0x50]
00000151A32E585D 541 4c8b4da8 REX.W movq r9,[rbp-0x58]
00000151A32E5861 545 4c8b4598 REX.W movq r8,[rbp-0x68]
00000151A32E5865 549 c5fb109578ffffff vmovsd xmm2,[rbp-0x88]
In short the difference is that for the "plus" case Mod (%
) operation is performed using highly specialized mod-by-power-of-2-i
machine code but for the "minus" case it is done using mod-d
which is floating point-based arithmetic implementation.
Notice also that mod-by-power-of-2-i
machine code does support negative values. It can be roughly re-written as something like this:
if (rax < 0) {
rax = -rax;
rax = rax & 1;
rax = -rax;
}
else {
rax = rax & 1;
}
So this is not a case of optimized machine code only for positive values.
2. The difference in generated code seem to come from the fact that type inference works differently. Logs produced by --trace_representation
show the following difference between "plus" and "minus" cases for a simplified program:
var f_minus = function(log) {
var str = '', i = gridSize, j;
while (i--) {
j = gridSize;
while (j--) {
var ttt = (i - j) % 2
}
}
if(log) {
if(ttt == -1)
console.log(t);
}
}
var f_plus = function(log) {
var str = '', i = gridSize, j;
while (i--) {
j = gridSize;
while (j--) {
var ttt = (i + j) % 2
}
}
if(log){
if(ttt == -1)
console.log(t);
}
}
Compare
[marking 00000025D4303E91 <JS Function f_minus (SharedFunctionInfo 00000278933F61C1)> for optimized recompilation, reason: small function, ICs with typeinfo: 8/12 (66%), generic ICs: 0/12 (0%)]
[compiling method 00000025D4303E91 <JS Function f_minus (SharedFunctionInfo 00000278933F61C1)> using Crankshaft OSR]
#37 Phi is used by real #110 Branch as v
#38 Phi is used by real #58 Add as t
#38 Phi is used by real #69 StackCheck as v
#38 Phi is used by real #70 LoadContextSlot as v
#38 Phi is used by real #122 CompareGeneric as t
#38 Phi is used by real #132 LoadGlobalGeneric as v
#38 Phi is used by real #134 LoadNamedGeneric as v
#38 Phi is used by real #136 LoadGlobalGeneric as v
#38 Phi is used by real #141 CallWithDescriptor as v
#38 Phi is used by real #156 Return as v
#38 Phi is used by real #101 Mod as t
#38 Phi is used by real #98 Sub as t
#38 Phi is used by real #95 StackCheck as v
#38 Phi is used by real #84 Add as t
#47 Phi is used by real #56 ForceRepresentation as s
#49 Phi is used by real #122 CompareGeneric as t
#77 Phi is used by real #83 ForceRepresentation as s
generalizing use representation 'v' of #40 Phi with uses of #47 Phi 's'
generalizing use representation 'v' of #42 Phi with uses of #49 Phi 't'
generalizing use representation 't' of #42 Phi with uses of #78 Phi 'v'
generalizing use representation 'v' of #49 Phi with uses of #78 Phi 'v'
generalizing use representation 'v' of #78 Phi with uses of #49 Phi 't'
Changing #101 Mod representation v -> i based on inputs
Changing #101 Mod representation i -> d based on output
Changing #98 Sub representation v -> s based on inputs
Changing #98 Sub representation s -> i based on use requirements
Changing #84 Add representation v -> i based on inputs
...
to this
[marking 000002C17CAAB341 <JS Function f_plus (SharedFunctionInfo 00000278933F6289)> for optimized recompilation, reason: small function, ICs with typeinfo: 8/12 (66%), generic ICs: 0/12 (0%)]
[compiling method 000002C17CAAB341 <JS Function f_plus (SharedFunctionInfo 00000278933F6289)> using Crankshaft OSR]
#37 Phi is used by real #110 Branch as v
#38 Phi is used by real #58 Add as t
#38 Phi is used by real #69 StackCheck as v
#38 Phi is used by real #70 LoadContextSlot as v
#38 Phi is used by real #122 CompareGeneric as t
#38 Phi is used by real #132 LoadGlobalGeneric as v
#38 Phi is used by real #134 LoadNamedGeneric as v
#38 Phi is used by real #136 LoadGlobalGeneric as v
#38 Phi is used by real #141 CallWithDescriptor as v
#38 Phi is used by real #156 Return as v
#38 Phi is used by real #101 Mod as t
#38 Phi is used by real #98 Add as t
#38 Phi is used by real #95 StackCheck as v
#38 Phi is used by real #84 Add as t
#47 Phi is used by real #56 ForceRepresentation as s
#49 Phi is used by real #122 CompareGeneric as t
#77 Phi is used by real #83 ForceRepresentation as s
generalizing use representation 'v' of #40 Phi with uses of #47 Phi 's'
generalizing use representation 'v' of #42 Phi with uses of #49 Phi 't'
generalizing use representation 't' of #42 Phi with uses of #78 Phi 'v'
generalizing use representation 'v' of #49 Phi with uses of #78 Phi 'v'
generalizing use representation 'v' of #78 Phi with uses of #49 Phi 't'
Changing #101 Mod representation v -> i based on inputs
Changing #98 Add representation v -> s based on inputs
Changing #98 Add representation s -> i based on use requirements
Changing #84 Add representation v -> i based on inputs
...
The interesting difference is the line
Changing #101 Mod representation i -> d based on output
that is only present in the f_minus
but not the f_plus
case. For some reason the compiler believes that in the f_minus
case type should be Double instead of Integer basing the guess on the output value. Interestingly, if I change the line
var ttt = (i - j) % 2
to
var ttt = (i - j + gridSize) % 2;
it again starts generating fast mod-by-power-of-2-i
code. So yes, it looks probable that the output value affects optimizing compiler. But it is not clear why this happens in this particular case.
At the first glance this behavior looks like a bug in the optimizing compiler that fails to notice that the "minus" case can be handled by mod-by-power-of-2-i
as well. I was not able to trace why this happens just glancing over the source code so further input is welcome.