I have three implementations of a function that checks whether a string (or a space delimited phrase) is a palindrome:
def palindrome(str_in):
def p(s, i, j):
if i >= j:
return True
elif s[i] != s[j]:
return False
else:
return p(s, i+1, j-1)
return p(str_in.replace(' ', '').lower(), 0, len(str_in)-1)
def palindrome1(s):
st = s.replace(' ', '').lower()
return st == st[::-1]
def palindrome2(s):
st = s.replace(' ', '').lower()
i, j = 0, len(st)-1
while i < j:
if st[i] != st[j]:
return False
else:
i += 1
j -= 1
return True
Now, I figured palindrome()
would be optimal in theory because no reversing and extra memory is taking place, but python does not have tail call optimization. palindrome2()
is the imperative version of palindrome()
but still takes much longer than palindrome1()
. Why is this?
Here is the profiled results (ran with: python -m cProfile file.py
):
12 function calls in 45.341 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.232 0.232 45.341 45.341 file.py:1(<module>)
1 2.198 2.198 3.532 3.532 file.py:300(palindrome1)
1 39.442 39.442 40.734 40.734 file.py:304(palindrome2)
1 0.000 0.000 0.000 0.000 {len}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
2 2.396 1.198 2.396 1.198 {method 'lower' of 'str' objects}
1 0.843 0.843 0.843 0.843 {method 'read' of 'file' objects}
2 0.231 0.115 0.231 0.115 {method 'replace' of 'str' objects}
1 0.000 0.000 0.000 0.000 {open}
1 0.000 0.000 0.000 0.000 {sys.setrecursionlimit}
Here is the profiled results(ran with: pypy -m cProfile hw2.py
):
11 function calls in 12.470 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.011 0.011 12.470 12.470 hw2.py:1(<module>)
1 2.594 2.594 6.280 6.280 hw2.py:303(palindrome1)
1 0.852 0.852 4.347 4.347 hw2.py:307(palindrome2)
1 0.000 0.000 0.000 0.000 {len}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
2 3.263 1.631 3.263 1.631 {method 'lower' of 'str' objects}
1 1.832 1.832 1.832 1.832 {method 'read' of 'file' objects}
2 3.918 1.959 3.918 1.959 {method 'replace' of 'str' objects}
1 0.000 0.000 0.000 0.000 {sys.setrecursionlimit}
Here is my palindrome constructor:
def palindrome_maker(n):
from random import choice
alphabet = ' abcdefghijklmnopqrstuvwxyz'
front = ''.join([choice(alphabet) for _ in range(n//2)])
back = front[::-1]
return front + (choice(alphabet) if n%2==1 else '') + back
BTW: the profile shows the performance for calling the functions with a string of length 999999999
.
OK, so let's talk from the begining. CPython compiles visible text into a thing called bytecode, which is a representation that is easier for the virtual machine (i.e. the interpreter) to understand.
Both palindrome
and palindrome2
functions are slower then palindrome1
because of this overhead. There's a neat module in CPython called dis
. If you use it on a compiled function it will show its internal representation. So lets do this:
>>> dis.dis(palindrome)
2 0 LOAD_CLOSURE 0 (p)
3 BUILD_TUPLE 1
6 LOAD_CONST 1 (<code object p at 0x01B95110, file "<stdin>", line 2>)
9 LOAD_CONST 2 ('palindrome.<locals>.p')
12 MAKE_CLOSURE 0
15 STORE_DEREF 0 (p)
9 18 LOAD_DEREF 0 (p)
21 LOAD_FAST 0 (str_in)
24 LOAD_ATTR 0 (replace)
27 LOAD_CONST 3 (' ')
30 LOAD_CONST 4 ('')
33 CALL_FUNCTION 2 (2 positional, 0 keyword pair)
36 LOAD_ATTR 1 (lower)
39 CALL_FUNCTION 0 (0 positional, 0 keyword pair)
42 LOAD_CONST 5 (0)
45 LOAD_GLOBAL 2 (len)
48 LOAD_FAST 0 (str_in)
51 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
54 LOAD_CONST 6 (1)
57 BINARY_SUBTRACT
58 CALL_FUNCTION 3 (3 positional, 0 keyword pair)
61 RETURN_VALUE
Now let's compare this with palindrome1
function:
>>> dis.dis(palindrome1)
2 0 LOAD_FAST 0 (s)
3 LOAD_ATTR 0 (replace)
6 LOAD_CONST 1 (' ')
9 LOAD_CONST 2 ('')
12 CALL_FUNCTION 2 (2 positional, 0 keyword pair)
15 LOAD_ATTR 1 (lower)
18 CALL_FUNCTION 0 (0 positional, 0 keyword pair)
21 STORE_FAST 1 (st)
3 24 LOAD_FAST 1 (st)
27 LOAD_FAST 1 (st)
30 LOAD_CONST 0 (None)
33 LOAD_CONST 0 (None)
36 LOAD_CONST 4 (-1)
39 BUILD_SLICE 3
42 BINARY_SUBSCR
43 COMPARE_OP 2 (==)
46 RETURN_VALUE
So this is what CPython more or less sees (actually these are encoded into a binary form, which is irrelevant at the moment). Then the virtual machine goes through those lines and executes them one by one.
So the first obvious thing is: more lines == more time to execute. This is because each line has to be interpreted and appropriate C code has to execute. And there are a lot of lines executed in both functions other then palindrome1
because of the loop and recursive calls. So essentially its like your trying to run few laps but Python says "no, no, no, you have to run with 20kg on your shoulders". The more laps there are (i.e. more bytecode to execude) the slower you get. Generally this performance degradation should be linear in CPython but really who knows without reading through CPython's code? I've heard that a technique called inline caching was supposed to be implemented in CPython which would affect the performance alot. I don't know whether it was done or not.
Other thing is that calls in Python are expensive. There is ABI for how calls should be done at the low level (i.e. push registers onto the stack and do jump). C/C++ follows it. Now Python does alot more than that. There are frames created (which can be analyzed, e.g. when exception happens), there's a max recursion check, etc. etc. All of that counts towards performance lose.
So palindrome
function does alot of calls. Recursion is inefficient in Python. In particular this is the reason why palindrome2
is faster then palindrome1
.
The other thing is that palindrome1
has [::-1]
which translates into BUILD_SLICE
call which is implemented in C. So even though it does more then necessary (there is no reason for creating another copy of the string) it is still faster then other functions simply because the intermediate layer (i.e. the bytecode) is minimal. There is no need for the compiler to waste time on bytecode interpretation.
Another important thing is that each object you create in Python has to be garbage collected. And since these objects are generally bigger then pure C objects (for example due to reference counter) than this takes more time. Ah, by the way, incrementing and decrementing reference counters also takes time. Also there's this thing called GIL (Global Interpreter Lock) which acquires and releases a lock at each command so that the bytecode is thread safe. Even though it is completely unnecessary for a single threaded application. But Python doesn't know that you won't run threads at some point, it has to do that each time. This is all so that you don't have to worry about hard problems that most C/C++ coders have to deal with. :)
Now PyPy is another story. It has this neat thing inside it called JIT = Just In Time compiler. What it does it takes any Python bytecode and converts it into machine code on the fly which then is reused. So the initial call to a function has this compiling overhead, but it still is faster. Ultimately there is no bytecode at all and all functions run purely on CPU. However this doesn't mean that PyPy is as fast as a function written in C (e.g. [::-1]
). Simply because there are lots of optimizations that are done on C level which we don't know how to implement in PyPy or any other Python interpreter. This is due to the nature of the language - it is dynamic. Now whether it is truely impossible is another story, it's not obvious at all, but at the moment we just don't know how to do this.
tl;dr; builtin functions (or more generally C code run in Python) are always at least as fast as equivalent pure Python code and in most cases alot faster