Given a numpy
array of threshold values, what is the most efficient way to produce an array of the counts of another array meeting these values?
Assume the threshold value array is small and sorted, and the array of values to be counted is large-ish and unsorted.
Example: for each element of valueLevels
, count the elements of values
greater than or equal to it:
import numpy as np
n = int(1e5) # size of example
# example levels: the sequence 0, 1., 2.5, 5., 7.5, 10, 5, ... 50000, 75000
valueLevels = np.concatenate(
[np.array([0.]),
np.concatenate([ [ x*10**y for x in [1., 2.5, 5., 7.5] ]
for y in range(5) ] )
]
)
np.random.seed(123)
values = np.random.uniform(low=0, high=1e5, size=n)
So far I have tried the list comprehension approach.
np.array([sum(values>=x) for x in valueLevels])
was unacceptably slownp.array([len(values[values>=x]) for x in valueLevels])
was an improvementvalues
did speed up the comprehension (in the example, from ~7 to 0.5 ms), but the cost of sort (~8 ms) exceeded the savings for one-time useThe best I have right now is a comprehension of this approach:
%%timeit
np.array([np.count_nonzero(values>=x) for x in valueLevels])
# 1000 loops, best of 3: 1.26 ms per loop
which is acceptable for my purposes, but out of curiosity,
What I would like to know is
np.broadcast_arrays()
.The fastest I have so far is
%timeit count_nonzero(values >= atleast_2d(valueLevels).T, axis=1)
# 1000 loops, best of 3: 860 µs per loop
sum
is slower:
%timeit sum(values >= atleast_2d(valueLevels).T, axis=1)
# 100 loops, best of 3: 2.5 ms per loop
@Divakar's version is even slower:
%timeit count_nonzero(values[:, None] >= valueLevels, axis=1)
# 100 loops, best of 3: 3.86 ms per loop
However, I would probably still use your list comprehension, which is not much slower and does not create a big 2D boolean array as an intermediate step:
%timeit np.array([np.count_nonzero(values>=x) for x in valueLevels])
# 1000 loops, best of 3: 987 µs per loop