Title sounds more complicated than the facts really are. Given the data
data = [
np.array(['x'], dtype='object'),
np.array(['y'], dtype='object'),
np.array(['z'], dtype='object'),
np.array(['x', 'z', 'y'], dtype='object'),
np.array(['y', 'x'], dtype='object'),
]
s = pd.Series(data)
I would like to retrieve to elements of s
where s == np.array(['x'])
. The obvious way
c = np.array(['x'])
s[s==c]
does not work, since there is a ValueError in the comparison, complaining that "'Lengths must match to compare', (5,), (1,)". I also tried
s[s=='x']
which only works if the elements of s
have all exactly one element themselves.
Is there a way to retrieve all elements of s
, where s == c
, without converting the elements to string?
If we use a loop, I think this is a simpler way.
out = s[s.apply(lambda x: x.tolist() == ['x'])]
out:
0 [x]
dtype: object
checking example
import pandas as pd
import numpy as np
data1 = [
np.array(['x'], dtype='object'),
np.array(['y'], dtype='object'),
np.array(['z'], dtype='object'),
np.array(['x', 'z', 'y'], dtype='object'),
np.array(['y', 'x'], dtype='object'),
] * 1000000
s1 = pd.Series(data1)
5000000 rows
c = np.array(['x'], dtype='object')
d = c.tolist()
chk speed
>>> import timeit
>>> %timeit s1[s1.apply(lambda x: x.tolist() == d)]
1.38 s ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit s1[[np.array_equal(a, c) for a in s1]]
22.2 s ± 754 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> from functools import partial
>>> eq_c = partial(np.array_equal, c)
>>> %timeit s1[map(eq_c, s1)]
21.8 s ± 449 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)