python recursion optimization time-complexity palindrome

Why do certain implementations run slow in Python?

I have three implementations of a function that checks whether a string (or a space delimited phrase) is a palindrome:

def palindrome(str_in):
    def p(s, i, j):
        if i >= j:
            return True
        elif s[i] != s[j]:
            return False
        else:
            return p(s, i+1, j-1)
    return p(str_in.replace(' ', '').lower(), 0, len(str_in)-1)

def palindrome1(s):
    st = s.replace(' ', '').lower()
    return st == st[::-1]

def palindrome2(s):
    st = s.replace(' ', '').lower()
    i, j = 0, len(st)-1
    while i < j:
        if st[i] != st[j]:
            return False
        else:
            i += 1
            j -= 1
    return True

Now, I figured palindrome() would be optimal in theory because no reversing and extra memory is taking place, but python does not have tail call optimization. palindrome2() is the imperative version of palindrome() but still takes much longer than palindrome1(). Why is this?

Here is the profiled results (ran with: python -m cProfile file.py):

         12 function calls in 45.341 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.232    0.232   45.341   45.341 file.py:1(<module>)
        1    2.198    2.198    3.532    3.532 file.py:300(palindrome1)
        1   39.442   39.442   40.734   40.734 file.py:304(palindrome2)
        1    0.000    0.000    0.000    0.000 {len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        2    2.396    1.198    2.396    1.198 {method 'lower' of 'str' objects}
        1    0.843    0.843    0.843    0.843 {method 'read' of 'file' objects}
        2    0.231    0.115    0.231    0.115 {method 'replace' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {open}
        1    0.000    0.000    0.000    0.000 {sys.setrecursionlimit}

Here is the profiled results(ran with: pypy -m cProfile hw2.py):

         11 function calls in 12.470 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.011    0.011   12.470   12.470 hw2.py:1(<module>)
        1    2.594    2.594    6.280    6.280 hw2.py:303(palindrome1)
        1    0.852    0.852    4.347    4.347 hw2.py:307(palindrome2)
        1    0.000    0.000    0.000    0.000 {len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        2    3.263    1.631    3.263    1.631 {method 'lower' of 'str' objects}
        1    1.832    1.832    1.832    1.832 {method 'read' of 'file' objects}
        2    3.918    1.959    3.918    1.959 {method 'replace' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {sys.setrecursionlimit}

Here is my palindrome constructor:

def palindrome_maker(n):
    from random import choice
    alphabet = ' abcdefghijklmnopqrstuvwxyz'
    front = ''.join([choice(alphabet) for _ in range(n//2)])
    back = front[::-1]
    return front + (choice(alphabet) if n%2==1 else '') + back

BTW: the profile shows the performance for calling the functions with a string of length 999999999.

Solution

OK, so let's talk from the begining. CPython compiles visible text into a thing called bytecode, which is a representation that is easier for the virtual machine (i.e. the interpreter) to understand.

Both palindrome and palindrome2 functions are slower then palindrome1 because of this overhead. There's a neat module in CPython called dis. If you use it on a compiled function it will show its internal representation. So lets do this:

>>> dis.dis(palindrome)
  2           0 LOAD_CLOSURE             0 (p)
              3 BUILD_TUPLE              1
              6 LOAD_CONST               1 (<code object p at 0x01B95110, file "<stdin>", line 2>)
              9 LOAD_CONST               2 ('palindrome.<locals>.p')
             12 MAKE_CLOSURE             0
             15 STORE_DEREF              0 (p)

  9          18 LOAD_DEREF               0 (p)
             21 LOAD_FAST                0 (str_in)
             24 LOAD_ATTR                0 (replace)
             27 LOAD_CONST               3 (' ')
             30 LOAD_CONST               4 ('')
             33 CALL_FUNCTION            2 (2 positional, 0 keyword pair)
             36 LOAD_ATTR                1 (lower)
             39 CALL_FUNCTION            0 (0 positional, 0 keyword pair)
             42 LOAD_CONST               5 (0)
             45 LOAD_GLOBAL              2 (len)
             48 LOAD_FAST                0 (str_in)
             51 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             54 LOAD_CONST               6 (1)
             57 BINARY_SUBTRACT
             58 CALL_FUNCTION            3 (3 positional, 0 keyword pair)
             61 RETURN_VALUE

Now let's compare this with palindrome1 function:

>>> dis.dis(palindrome1)
  2           0 LOAD_FAST                0 (s)
              3 LOAD_ATTR                0 (replace)
              6 LOAD_CONST               1 (' ')
              9 LOAD_CONST               2 ('')
             12 CALL_FUNCTION            2 (2 positional, 0 keyword pair)
             15 LOAD_ATTR                1 (lower)
             18 CALL_FUNCTION            0 (0 positional, 0 keyword pair)
             21 STORE_FAST               1 (st)

  3          24 LOAD_FAST                1 (st)
             27 LOAD_FAST                1 (st)
             30 LOAD_CONST               0 (None)
             33 LOAD_CONST               0 (None)
             36 LOAD_CONST               4 (-1)
             39 BUILD_SLICE              3
             42 BINARY_SUBSCR
             43 COMPARE_OP               2 (==)
             46 RETURN_VALUE

So this is what CPython more or less sees (actually these are encoded into a binary form, which is irrelevant at the moment). Then the virtual machine goes through those lines and executes them one by one.

So the first obvious thing is: more lines == more time to execute. This is because each line has to be interpreted and appropriate C code has to execute. And there are a lot of lines executed in both functions other then palindrome1 because of the loop and recursive calls. So essentially its like your trying to run few laps but Python says "no, no, no, you have to run with 20kg on your shoulders". The more laps there are (i.e. more bytecode to execude) the slower you get. Generally this performance degradation should be linear in CPython but really who knows without reading through CPython's code? I've heard that a technique called inline caching was supposed to be implemented in CPython which would affect the performance alot. I don't know whether it was done or not.

Other thing is that calls in Python are expensive. There is ABI for how calls should be done at the low level (i.e. push registers onto the stack and do jump). C/C++ follows it. Now Python does alot more than that. There are frames created (which can be analyzed, e.g. when exception happens), there's a max recursion check, etc. etc. All of that counts towards performance lose.

So palindrome function does alot of calls. Recursion is inefficient in Python. In particular this is the reason why palindrome2 is faster then palindrome1.

The other thing is that palindrome1 has [::-1] which translates into BUILD_SLICE call which is implemented in C. So even though it does more then necessary (there is no reason for creating another copy of the string) it is still faster then other functions simply because the intermediate layer (i.e. the bytecode) is minimal. There is no need for the compiler to waste time on bytecode interpretation.

Another important thing is that each object you create in Python has to be garbage collected. And since these objects are generally bigger then pure C objects (for example due to reference counter) than this takes more time. Ah, by the way, incrementing and decrementing reference counters also takes time. Also there's this thing called GIL (Global Interpreter Lock) which acquires and releases a lock at each command so that the bytecode is thread safe. Even though it is completely unnecessary for a single threaded application. But Python doesn't know that you won't run threads at some point, it has to do that each time. This is all so that you don't have to worry about hard problems that most C/C++ coders have to deal with. :)

Now PyPy is another story. It has this neat thing inside it called JIT = Just In Time compiler. What it does it takes any Python bytecode and converts it into machine code on the fly which then is reused. So the initial call to a function has this compiling overhead, but it still is faster. Ultimately there is no bytecode at all and all functions run purely on CPU. However this doesn't mean that PyPy is as fast as a function written in C (e.g. [::-1]). Simply because there are lots of optimizations that are done on C level which we don't know how to implement in PyPy or any other Python interpreter. This is due to the nature of the language - it is dynamic. Now whether it is truely impossible is another story, it's not obvious at all, but at the moment we just don't know how to do this.

tl;dr; builtin functions (or more generally C code run in Python) are always at least as fast as equivalent pure Python code and in most cases alot faster