I have a pandas.Series
containing integers, but I need to convert these to strings for some downstream tools. So suppose I had a Series
object:
import numpy as np
import pandas as pd
x = pd.Series(np.random.randint(0, 100, 1000000))
On StackOverflow and other websites, I've seen most people argue that the best way to do this is:
%% timeit
x = x.astype(str)
This takes about 2 seconds.
When I use x = x.apply(str)
, it only takes 0.2 seconds.
Why is x.astype(str)
so slow? Should the recommended way be x.apply(str)
?
I'm mainly interested in python 3's behavior for this.
Performance
It's worth looking at actual performance before beginning any investigation since, contrary to popular opinion, list(map(str, x))
appears to be slower than x.apply(str)
.
import pandas as pd, numpy as np
### Versions: Pandas 0.20.3, Numpy 1.13.1, Python 3.6.2 ###
x = pd.Series(np.random.randint(0, 100, 100000))
%timeit x.apply(str) # 42ms (1)
%timeit x.map(str) # 42ms (2)
%timeit x.astype(str) # 559ms (3)
%timeit [str(i) for i in x] # 566ms (4)
%timeit list(map(str, x)) # 536ms (5)
%timeit x.values.astype(str) # 25ms (6)
Points worth noting:
lambda
function is used].Why is x.map / x.apply fast?
This appears to be because it uses fast compiled Cython code:
cpdef ndarray[object] astype_str(ndarray arr):
cdef:
Py_ssize_t i, n = arr.size
ndarray[object] result = np.empty(n, dtype=object)
for i in range(n):
# we can use the unsafe version because we know `result` is mutable
# since it was created from `np.empty`
util.set_value_at_unsafe(result, i, str(arr[i]))
return result
Why is x.astype(str) slow?
Pandas applies str
to each item in the series, not using the above Cython.
Hence performance is comparable to [str(i) for i in x]
/ list(map(str, x))
.
Why is x.values.astype(str) so fast?
Numpy does not apply a function on each element of the array. One description of this I found:
If you did
s.values.astype(str)
what you get back is an object holdingint
. This isnumpy
doing the conversion, whereas pandas iterates over each item and callsstr(item)
on it. So if you dos.astype(str)
you have an object holdingstr
.
There is a technical reason why the numpy version hasn't been implemented in the case of no-nulls.