Search code examples
pythonstringperformancepandaspython-internals

Converting a series of ints to strings - Why is apply much faster than astype?


I have a pandas.Series containing integers, but I need to convert these to strings for some downstream tools. So suppose I had a Series object:

import numpy as np
import pandas as pd

x = pd.Series(np.random.randint(0, 100, 1000000))

On StackOverflow and other websites, I've seen most people argue that the best way to do this is:

%% timeit
x = x.astype(str)

This takes about 2 seconds.

When I use x = x.apply(str), it only takes 0.2 seconds.

Why is x.astype(str) so slow? Should the recommended way be x.apply(str)?

I'm mainly interested in python 3's behavior for this.


Solution

  • Performance

    It's worth looking at actual performance before beginning any investigation since, contrary to popular opinion, list(map(str, x)) appears to be slower than x.apply(str).

    import pandas as pd, numpy as np
    
    ### Versions: Pandas 0.20.3, Numpy 1.13.1, Python 3.6.2 ###
    
    x = pd.Series(np.random.randint(0, 100, 100000))
    
    %timeit x.apply(str)          # 42ms   (1)
    %timeit x.map(str)            # 42ms   (2)
    %timeit x.astype(str)         # 559ms  (3)
    %timeit [str(i) for i in x]   # 566ms  (4)
    %timeit list(map(str, x))     # 536ms  (5)
    %timeit x.values.astype(str)  # 25ms   (6)
    

    Points worth noting:

    1. (5) is marginally quicker than (3) / (4), which we expect as more work is moved into C [assuming no lambda function is used].
    2. (6) is by far the fastest.
    3. (1) / (2) are similar.
    4. (3) / (4) are similar.

    Why is x.map / x.apply fast?

    This appears to be because it uses fast compiled Cython code:

    cpdef ndarray[object] astype_str(ndarray arr):
        cdef:
            Py_ssize_t i, n = arr.size
            ndarray[object] result = np.empty(n, dtype=object)
    
        for i in range(n):
            # we can use the unsafe version because we know `result` is mutable
            # since it was created from `np.empty`
            util.set_value_at_unsafe(result, i, str(arr[i]))
    
        return result
    

    Why is x.astype(str) slow?

    Pandas applies str to each item in the series, not using the above Cython.

    Hence performance is comparable to [str(i) for i in x] / list(map(str, x)).

    Why is x.values.astype(str) so fast?

    Numpy does not apply a function on each element of the array. One description of this I found:

    If you did s.values.astype(str) what you get back is an object holding int. This is numpy doing the conversion, whereas pandas iterates over each item and calls str(item) on it. So if you do s.astype(str) you have an object holding str.

    There is a technical reason why the numpy version hasn't been implemented in the case of no-nulls.