Search code examples
pythonarraysnumpyprobability-density

efficient numpy.cumsum and numpy.digitize


Given a matrix of values that represent probabilities I am trying to write an efficient process that returns the bin that the value belongs to. For example:

sample = 0.5
x = np.array([0.1]*10)
np.digitize( sample, np.cumsum(x))-1
#returns 5

is the result I am looking for. According to timeit for x arrays with few elements it is more efficient to do it as:

cdf = 0
for key,val in enumerate(x):
    cdf += val
    if sample<=cdf:
        print key
        break

while for bigger x arrays the numpy solution is faster. The question:

  1. Is there a way to further accelerate it, e.g., a function that combines the steps?
  2. Can we vectorize the process it for the case where sample is a list, whose each item is associated with its own x array (x will then be 2-D)?

In the application x contains the marginal probabilities; this is way I need to decrement the results of np.digitize


Solution

  • You could use some broadcasting magic there -

    (x.cumsum(1) > sample[:,None]).argmax(1)-1
    

    Steps involved :

    I. Perform cumsum along each row.

    II. Use broadcasted comparison for each cumsum row against each sample value and look for the first occurrence of sample being lesser than cumsum values, signalling that the element before that in x is the index we are looking for.

    Step-by-step run -

    In [64]: x
    Out[64]: 
    array([[ 0.1 ,  0.1 ,  0.1 ,  0.1 ,  0.1 ,  0.1 ,  0.1 ],
           [ 0.8 ,  0.96,  0.88,  0.36,  0.5 ,  0.68,  0.71],
           [ 0.37,  0.56,  0.5 ,  0.01,  0.77,  0.88,  0.36],
           [ 0.62,  0.08,  0.37,  0.93,  0.65,  0.4 ,  0.79]])
    
    In [65]: sample # one elem per row of x
    Out[65]: array([ 0.5,  2.2,  1.9,  2.2])
    
    In [78]: x.cumsum(1)
    Out[78]: 
    array([[ 0.1 ,  0.2 ,  0.3 ,  0.4 ,  0.5 ,  0.6 ,  0.7 ],
           [ 0.8 ,  1.76,  2.64,  2.99,  3.49,  4.18,  4.89],
           [ 0.37,  0.93,  1.43,  1.45,  2.22,  3.1 ,  3.47],
           [ 0.62,  0.69,  1.06,  1.99,  2.64,  3.04,  3.83]])
    
    In [79]: x.cumsum(1) > sample[:,None]
    Out[79]: 
    array([[False, False, False, False, False,  True,  True],
           [False, False,  True,  True,  True,  True,  True],
           [False, False, False, False,  True,  True,  True],
           [False, False, False, False,  True,  True,  True]], dtype=bool)
    
    In [80]: (x.cumsum(1) > sample[:,None]).argmax(1)-1
    Out[80]: array([4, 1, 3, 3])
    
    # A loopy solution to verify results against
    In [81]: [np.digitize( sample[i], np.cumsum(x[i]))-1 for i in range(x.shape[0])]
    Out[81]: [4, 1, 3, 3]
    

    Boundary cases :

    The proposed solution automatically handles the cases where sample values are lesser than smallest of cumulative summed values -

    In [113]: sample[0] = 0.08  # editing first sample to be lesser than 0.1
    
    In [114]: [np.digitize( sample[i], np.cumsum(x[i]))-1 for i in range(x.shape[0])]
    Out[114]: [-1, 1, 3, 3]
    
    In [115]: (x.cumsum(1) > sample[:,None]).argmax(1)-1
    Out[115]: array([-1,  1,  3,  3])
    

    For cases where a sample value is greater than largest of cumulative summed values, we need one extra step -

    In [116]: sample[0] = 0.8  # editing first sample to be greater than 0.7
    
    In [121]: mask = (x.cumsum(1) > sample[:,None])
    
    In [122]: idx = mask.argmax(1)-1
    
    In [123]: np.where(mask.any(1),idx,x.shape[1]-1)
    Out[123]: array([6, 1, 3, 3])
    
    In [124]: [np.digitize( sample[i], np.cumsum(x[i]))-1 for i in range(x.shape[0])]
    Out[124]: [6, 1, 3, 3]