Search code examples
pandasnumpyvectorizationnumpy-ndarrayarray-broadcasting

Custom Vectorization in numpy for string arrays


I am trying to apply vectorization with custom function on numpy string arrays.

Example:

import numpy

test_array = numpy.char.array(["sample1-sample","sample2-sample"])

numpy.char.array(test_array.split('-'))[:,0]

Op:

chararray([b'sample1', b'sample2'], dtype='|S7')

But these are in-built functions, is there any other method to achieve vectorization with custom functions. Example, with the following function:

def custom(text):
    return text[0]

Solution

  • numpy doesn't implement fast string methods (as it does for numeric dtypes). So the np.char code is more for convenience than performance.

    In [124]: alist=["sample1-sample","sample2-sample"]
    In [125]: arr = np.array(alist)
    In [126]: carr = np.char.array(alist)
    

    A straightforward list comprehension versus your code:

    In [127]: [item.split('-')[0] for item in alist]
    Out[127]: ['sample1', 'sample2']
    In [128]: np.char.array(carr.split('-'))[:,0]
    Out[128]: chararray([b'sample1', b'sample2'], dtype='|S7')
    In [129]: timeit [item.split('-')[0] for item in alist]
    664 ns ± 32.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
    In [130]: timeit np.char.array(carr.split('-'))[:,0]
    20.5 µs ± 297 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    

    For the simple task of clipping the strings, there is a fast numpy way - using a shorter dtype:

    In [131]: [item[0] for item in alist]
    Out[131]: ['s', 's']
    In [132]: carr.astype('S1')
    Out[132]: chararray([b's', b's'], dtype='|S1')
    

    But assuming that's just an example, not your real world custom action, I suggest using lists.

    np.char recommends using the np.char functions and ordinary array instead of np.char.array. The functionality is basically the same. But using the arr above:

    In [140]: timeit np.array(np.char.split(arr, '-').tolist())[:,0]
    13.8 µs ± 90.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    

    np.char functions often produce string dtype arrays, but split creates an object dtype array of lists:

    In [141]: np.char.split(arr, '-')
    Out[141]: 
    array([list(['sample1', 'sample']), list(['sample2', 'sample'])],
          dtype=object)
    

    Object dtype arrays are essentially lists.

    In [145]: timeit [item[0] for item in np.char.split(arr, '-').tolist()]
    9.08 µs ± 27.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    

    Your code is relatively slow because it takes time to convert this array of lists into a new chararray.