Search code examples
arrayspandasindexingcomparisonseries

Comparison and indexing series of arrays with length > 1


Title sounds more complicated than the facts really are. Given the data

data = [
    np.array(['x'], dtype='object'),
    np.array(['y'], dtype='object'),
    np.array(['z'], dtype='object'),
    np.array(['x', 'z', 'y'], dtype='object'),
    np.array(['y', 'x'], dtype='object'),
]    

s = pd.Series(data)

I would like to retrieve to elements of s where s == np.array(['x']). The obvious way

c = np.array(['x'])
s[s==c]

does not work, since there is a ValueError in the comparison, complaining that "'Lengths must match to compare', (5,), (1,)". I also tried

s[s=='x']

which only works if the elements of s have all exactly one element themselves.

Is there a way to retrieve all elements of s, where s == c, without converting the elements to string?


Solution

  • If we use a loop, I think this is a simpler way.

    out = s[s.apply(lambda x: x.tolist() == ['x'])]
    

    out:

    0    [x]
    dtype: object
    

    checking example

    import pandas as pd
    import numpy as np
    
    data1 = [
        np.array(['x'], dtype='object'),
        np.array(['y'], dtype='object'),
        np.array(['z'], dtype='object'),
        np.array(['x', 'z', 'y'], dtype='object'),
        np.array(['y', 'x'], dtype='object'),
    ]  * 1000000
    s1 = pd.Series(data1)
    

    5000000 rows

    c = np.array(['x'], dtype='object')
    d = c.tolist()
    

    chk speed

    >>> import timeit
    >>> %timeit s1[s1.apply(lambda x: x.tolist() == d)]
    
    1.38 s ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    >>> %timeit s1[[np.array_equal(a, c) for a in s1]]
    
    22.2 s ± 754 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    >>> from functools import partial
    >>> eq_c = partial(np.array_equal, c)
    >>> %timeit s1[map(eq_c, s1)]
    
    
    21.8 s ± 449 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)