Python Numpy : np.int32 "slower" than np.float64

I would like to understand a strange behavior of python. Let us consider a matrix Mwith shape 6000 x 2000. This matrix is filled with signed integers. I want to compute np.transpose(M)*M. Two options:

When I do it "naturally" (i.e. without specifying any typing), numpy selects the type np.int32 and the operation takes around 150s.
When I force the type to be np.float64 (using dtype=...), the same operation takes around 2s.

How can we explain this behavior ? I was naively thinking that a int multiplication was cheaper than a float multiplication.

Thanks a lot for your help.

Solution

No, integer multiplies aren't cheaper. But more on that later. Most likely (I am 99% sure) numpy calls BLAS routine under blankets, which can be as efficient as 90% of peak CPU performance. There aren't special provisions for int matrix multiplies, most likely it is done in Python rather than machine-compiled version - I am actually wrong on this, see below.

With regards to int vs float speed: on most architectures (Intel) they are roughly the same, around 3-5 cycles or so per instruction, both have serial (X87) and vector (XMM) version. On Sandy bridge, PMUL*** (integer vector multiply) is 5 cycles and so are the MULP* (floating multiplies). With Sandy Bridge you also have 256-bit SIMD vectors ops (YMM) - you get 8 float ops per instructions - I am not sure if there is an int counterpart.

This here is a great reference: http://www.agner.org/optimize/instruction_tables.pdf

That said, instruction latencies don't explain 75X speed difference. It is probably a combination of optimized BLAS (threaded probably) and int32 being handled in Python rather than C/Fortran.

I profiled following snippet:

>>> F = (np.random.random((6000,2000))+4)
>>> I = F.astype(np.int32)
>>> np.dot(F, F.transpose()); np.dot(I, I.transpose())

and this is what oprofile says:

CPU_CLK_UNHALT...|
  samples|      %|
------------------
  2076880 51.5705 multiarray.so
  1928787 47.8933 libblas.so.3.0

However the libblas is unoptimized serial Netlib Blas. With a good BLAS implementation that 47% will be much lower, especially if it is threaded.

Edit: It seems numpy does provide compiled version of integer matrix multiply.