python performance optimization binary binary-search

(AoC Day 5) Decoding the string using binary search is as fast as binary conversion?

This question is regarding the solution to the (first half of) Day 5 problem for Advent of Code 2020.

I have written two different functions to get to the same result, i.e. decoding the boarding pass string into a row, column coordinate. In the first case I went through a binary search based upon each character in the string:

def decode(bp):
  row = bp[:7]
  col = bp[-3:]
  row_lower, row_upper = 0, 127
  col_lower, col_upper = 0, 7
  for char in row:
      if char=='F':
          row_upper = ((row_upper+row_lower))//2
      else:
          row_lower = ((row_upper+row_lower)+1)//2
  for char in col:
      if char=='L':
          col_upper = ((col_upper+col_lower))//2
      else:
          col_lower = ((col_upper+col_lower)+1)//2
  sid = (row_lower*8)+col_lower
  return (row_lower, col_lower, sid)

I then realized that if we consider the string as a binary number, there is a 1:1 map to each row, column coordinate, and so I also wrote this alternative solution to the problem:

def alternative_decode(bp):
  bp = bp.replace('F', '0').replace('L', '0').replace('B', '1').replace('R', '1')
  return(int(bp[:7], 2), int(bp[-3:], 2), (int(bp[:7], 2)*8)+int(bp[-3:], 2))

I wrote the second solution because I expected it to be considerably faster than the first, given that it's a simple binary conversion rather than a binary search. However, when timing the two approaches, I noticed that both have the same running time, i.e roughly between 0.0000020 and 0.0000025 seconds.

What is the cause of this? Is there some Python magic happening behind the scenes that makes both solutions equally efficient, or did I write them in such a way that makes them equally unefficient?

Solution

On my computer, your alternative becomes significantly faster if you reuse the row/col result, rather than repeating the int() calls for the sid return:

def alternative_decode_reuse(bp):
  bp = bp.replace('F', '0').replace('L', '0').replace('B', '1').replace('R', '1')
  row = int(bp[:7], 2)
  col = int(bp[-3:], 2)
  sid = row*8 + col
  return (row, col, sid)

After putting all three functions in a file decoding.py, this gives us:

tmp$ python -m timeit -s 'import decoding' -- 'decoding.decode("FFBBFBFLLR")'
1000000 loops, best of 3: 1.75 usec per loop
tmp$ python -m timeit -s 'import decoding' -- 'decoding.alternative_decode("FFBBFBFLLR")'
1000000 loops, best of 3: 1.62 usec per loop
tmp$ python -m timeit -s 'import decoding' -- 'decoding.alternative_decode_reuse("FFBBFBFLLR")'
1000000 loops, best of 3: 1.32 usec per loop

At this point, most of the time is spent doing the replace() line. So what if we didn't?

def third_decode(bp):
    row = 0
    for c in bp[:7]:
        row <<= 1
        if c == 'B':
            row += 1
    col = 0
    for c in bp[7:]:
        col <<= 1
        if c == 'R':
            col += 1
    sid = row * 8 + col
    return (row, col, sid)

This gives:

tmp$ python -m timeit -s 'import decoding' -- 'decoding.third_decode("FFBBFBFLLR")'
1000000 loops, best of 3: 1.37 usec per loop

Slightly worse, or at least not clearly better. What if we also use the fact that the desired sid is equivalent to a (binary) concatenation of the row/col numbers?

def fourth_decode(bp):
    sid = 0
    for c in bp:
        sid <<= 1
        if c in 'BR':
            sid += 1
    row = sid >> 3
    col = sid & 7
    return (row, col, sid)

Yes, that helps a little:

tmp$ python -m timeit -s 'import decoding' -- 'decoding.fourth_decode("FFBBFBFLLR")'
1000000 loops, best of 3: 1.16 usec per loop

At this point, I'm tired of editing cmdline args to re-run everything, so let's add this to the bottom of decoding.py:

if __name__ == '__main__':
    loops = 1000000
    funcs = (
            decode, alternative_decode, alternative_decode_reuse, replace_only,
            third_decode, fourth_decode,
    )
    from timeit import Timer
    for fun in funcs:
        cmd = 'decoding.%s("FFBBFBFLLR")' % fun.__name__
        timer = Timer(cmd, setup='import decoding')
        totaltime = min(timer.repeat(5, loops))
        fmt = '%25s returned %14r -- %8d loops, best of 5: %6d ns per loop'
        arg = (fun.__name__, fun('FFBBFBFLLR'), loops, totaltime*1000000000/loops)
        print(fmt % arg)

This lets us run all the functions without hassle:

tmp$ python decoding.py
                   decode returned   (26, 1, 209) --  1000000 loops, best of 5:   2090 ns per loop
       alternative_decode returned   (26, 1, 209) --  1000000 loops, best of 5:   1829 ns per loop
 alternative_decode_reuse returned   (26, 1, 209) --  1000000 loops, best of 5:   1414 ns per loop
             replace_only returned           None --  1000000 loops, best of 5:    700 ns per loop
             third_decode returned   (26, 1, 209) --  1000000 loops, best of 5:   1368 ns per loop
            fourth_decode returned   (26, 1, 209) --  1000000 loops, best of 5:   1123 ns per loop

After that, I'm out of ideas of how to make it go faster. But last year, an Advent of Code-solving acquaintance told me he's using the pypy Python implementation for its speed. Maybe it can help?

tmp$ pypy decoding.py
                   decode returned   (26, 1, 209) --  1000000 loops, best of 5:    151 ns per loop
       alternative_decode returned   (26, 1, 209) --  1000000 loops, best of 5:      4 ns per loop
 alternative_decode_reuse returned   (26, 1, 209) --  1000000 loops, best of 5:      3 ns per loop
             replace_only returned           None --  1000000 loops, best of 5:      3 ns per loop
             third_decode returned   (26, 1, 209) --  1000000 loops, best of 5:    141 ns per loop
            fourth_decode returned   (26, 1, 209) --  1000000 loops, best of 5:    138 ns per loop

Well, all my effort for nothing! :)

It looks like pypy's replace() and int() functions are much faster. Also, while its JIT does make our various loopy functions faster, it's still preferable to use the builtin functions when possible.