Search code examples
pythonnumpycumulative-frequency

Fast counts of elements of numpy array by value thresholds in another array


Given a numpy array of threshold values, what is the most efficient way to produce an array of the counts of another array meeting these values?

Assume the threshold value array is small and sorted, and the array of values to be counted is large-ish and unsorted.

Example: for each element of valueLevels, count the elements of values greater than or equal to it:

import numpy as np

n = int(1e5) # size of example

# example levels: the sequence 0, 1., 2.5, 5., 7.5, 10, 5, ... 50000, 75000
valueLevels =  np.concatenate(
                   [np.array([0.]), 
                    np.concatenate([ [ x*10**y for x in [1., 2.5, 5., 7.5] ] 
                                   for y in range(5) ] ) 
                    ]
                )

np.random.seed(123)
values = np.random.uniform(low=0, high=1e5, size=n)

So far I have tried the list comprehension approach.

  • np.array([sum(values>=x) for x in valueLevels])was unacceptably slow
  • np.array([len(values[values>=x]) for x in valueLevels]) was an improvement
  • sorting values did speed up the comprehension (in the example, from ~7 to 0.5 ms), but the cost of sort (~8 ms) exceeded the savings for one-time use

The best I have right now is a comprehension of this approach:

%%timeit 
np.array([np.count_nonzero(values>=x) for x in valueLevels])
# 1000 loops, best of 3: 1.26 ms per loop

which is acceptable for my purposes, but out of curiosity,

What I would like to know is

  • If list comprehension is the way to go, can it be sped up? Or,
  • Are other approaches faster? (I have a vague sense that this could be done by broadcasting the values array over the thresholds array, but I can't figure out how to get the dimensions right for np.broadcast_arrays().

Solution

  • The fastest I have so far is

    %timeit count_nonzero(values >= atleast_2d(valueLevels).T, axis=1)
    # 1000 loops, best of 3: 860 µs per loop
    

    sum is slower:

    %timeit sum(values >= atleast_2d(valueLevels).T, axis=1)
    # 100 loops, best of 3: 2.5 ms per loop
    

    @Divakar's version is even slower:

    %timeit count_nonzero(values[:, None] >= valueLevels, axis=1)
    # 100 loops, best of 3: 3.86 ms per loop
    

    However, I would probably still use your list comprehension, which is not much slower and does not create a big 2D boolean array as an intermediate step:

    %timeit np.array([np.count_nonzero(values>=x) for x in valueLevels])
    # 1000 loops, best of 3: 987 µs per loop