javascript canvas html5-canvas typed-arrays

Why TypedArray access is faster when using 32bit BufferView?

A couple of days ago i have been playing with canvas Pixel by Pixel manipulation and i have noticed a slight performance increase when accessing typed arrays from 32bit BufferView.

Example:

JsFiddle

var canvas = document.querySelector("canvas");
var ctx = canvas.getContext("2d");

var image_data = ctx.getImageData(0, 0, canvas.height, canvas.width);
var image_buffer = new ArrayBuffer(image_data.data.length);
var image_buffer8 = new Uint8ClampedArray(image_buffer);
var image_buffer32 = new Uint32Array(image_buffer);
var pixel, color;

console.time("array-index");
for(pixel=0; pixel< image_buffer8.length; pixel += 4) {
    color = Math.random() * 255;
    image_buffer8[pixel]    = color;    // Red
    image_buffer8[pixel +1] = color;    // Green
    image_buffer8[pixel +2] = color;    // Blue
    image_buffer8[pixel +3] = 255;      // Alpha
}
console.timeEnd("array-index");

console.time("array-bitwise");
for(pixel = 0; pixel<image_buffer32.length; pixel++){
    color = Math.random() * 255;
    image_buffer32[pixel] = ( 255 << 24 | color << 16 | color << 8 | color );
}
console.timeEnd("array-bitwise");

The output is :

array-index: 4.273ms
array-bitwise: 3.743ms

The question is:

Why accessing the array from a 32bit BufferView is faster even if it has bitwise operators inside , as i see it bitwise arithmetic should also cost a CPU time ?

I am interested in the following aspects :

From the Hardware/JS point of view , why 32bit assignment is faster ?
How the number of bitwise operators inside the assignment affects the performance ?
Can i increase assignment performance even more ? Is it possible to use 64 bit chunks or bigger ?
Can i convert this code to benefit from asm.js platform to increase performance even more ?

Solution

The 8 bit assignment operations are much more expensive than the bitwise operations - you have to take a look at this kind of things from the way modern CPUs are architected: internally all the pathways are (at least) 32bit wide. Moving data from one point to another - in this case a calculated result "costs" the same: if you are moving 8 bit around, it takes as much CPU resources as moving 32 bit around - so, int he 8 bit case, you are doing the movement 4 times - and even if moving only from the CPU caculating unity to Level 1 cache, it is still 4 times more expensive than a single 32 bit data movement.

When coding in static typed languages with modern compilers, like C, the compiler could, possibly, automatically optimize this kind of code using a "SIMD" (Single Instruction, Multiple Data) machine instruction to actually pack the four 8 bit assignment as a single 32 bit assignment internally (even if not likely). That is much harder to do with a dynamic language such as javascript, even if it is running in a JITted environment (real time optimization to native code).