python optimization bitwise-operators micro-optimization

Why are bitwise operators slower than multiplication/division/modulo?

It's a well known fact that multiplication, integer division, and modulo by powers of two can be rewritten more efficiently as bitwise operations:

>>> x = randint(50000, 100000)
>>> x << 2 == x * 4
True
>>> x >> 2 == x // 4
True
>>> x & 3 == x % 4
True

In compiled languages such as C/C++ and Java, tests have shown that bitwise operations are generally faster than arithmetic operations. (See here and here). However, when I test these in Python, I am getting contrary results:

In [1]: from random import randint
   ...: nums = [randint(0, 1000000) for _ in range(100000)]

In [2]: %timeit [i * 8 for i in nums]
7.73 ms ± 397 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [3]: %timeit [i << 3 for i in nums]
8.22 ms ± 368 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit [i // 8 for i in nums]
7.05 ms ± 393 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: %timeit [i >> 3 for i in nums]
7.55 ms ± 367 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: %timeit [i % 8 for i in nums]
5.96 ms ± 503 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [7]: %timeit [i & 7 for i in nums]
8.29 ms ± 816 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

As you can see, bitwise operations are slower than their arithmetic counterparts, especially for modulo. I repeated this test for another set of numbers, and got the same result. Is there a reason for this? These tests were in CPython 3.6.7, if that matters.

Solution

*, %, and / all have fast paths for single-"limb" integers. <<, >>, and & don't. They're going through the general-purpose arbitrary-precision code path.