Search code examples
pythonmultidimensional-arrayscipybinning

scipy.stats.binned_statistic_dd() bin numbering has lots of extra bins


I'm struggling to deal with a scipy.stats.binned_statistic_dd() result. I have an array of positions and another array of ids that I'm binning in 3 directions. I'm providing a list of the bin edges as input rather than a number of bins in each direction coupled with a range option. I have 3 bins in x, 2 in y, and 3 in z, or 18 bins.

However, when I check the binnumbers listed, they are all in a range greater than 20. How do I get the bin numbers to reflect the number of bins provided and get rid of all the extra bins?

I've tried to follow what was suggested in this post (Output in scipy.stats.binned_statistic_dd()) which deals with something similar, but I can't understand how to apply this to my case. As usual, the documentation is as cryptic as ever.

Any help on get my binnumbers between 1-18 in this example would be greatly appreciated!

pos = np.array([[-0.02042167, -0.0223282 ,  0.00123734],
       [-0.0420364 ,  0.01196078,  0.00694259],
       [-0.09625651, -0.00311446,  0.06125461],
       [-0.07693234, -0.02749618,  0.03617278],
       [-0.07578646,  0.01199925,  0.02991888],
       [-0.03258293, -0.00371765,  0.04245596],
       [-0.06765955,  0.02798434,  0.07075846],
       [-0.02431445,  0.02774102,  0.06719837],
       [ 0.02798265, -0.01096739, -0.01658691],
       [-0.00584252,  0.02043389, -0.00827088],
       [ 0.00623063, -0.02642285,  0.03232817],
       [ 0.00884222,  0.01498996,  0.02912483],
       [ 0.07189474, -0.01541584,  0.01916607],
       [ 0.07239394,  0.0059483 ,  0.0740187 ],
       [-0.08519159, -0.02894125,  0.10923724],
       [-0.10803509,  0.01365444,  0.09555333],
       [-0.0442866 , -0.00845725,  0.10361843],
       [-0.04246779,  0.00396127,  0.1418258 ],
       [-0.08975861,  0.02999023,  0.12713186],
       [ 0.01772454, -0.0020405 ,  0.08824418]])

ids = np.array([16,  9,  6, 19,  1,  4, 10,  5, 18, 11,  2, 12, 13,  8,  3, 17, 14,
       15, 20,  7])

xbinEdges = np.array([-0.15298488, -0.05108961,  0.05080566,  0.15270093])
ybinEdges = np.array([-0.051,  0.   ,  0.051])
zbinEdges = np.array([-0.053,  0.049,  0.151,  0.253])

ret = stats.binned_statistic_dd(pos, ids, bins=[xbinEdges, ybinEdges, zbinEdges],
                                statistic='count', expand_binnumbers=False)
bincounts = ret.statistic
binnumber = ret.binnumber.T

>>> binnumber  = array([46, 51, 27, 26, 31, 46, 32, 52, 46, 51, 46, 51, 66, 72, 27, 32, 47,
       52, 32, 47], dtype=int64)

ranges = [[-0.15298488071, 0.15270092971],
 [-0.051000000000000004, 0.051000000000000004],
 [-0.0530000000000001, 0.25300000000000006]]

ret3 = stats.binned_statistic_dd(pos, ids, bins=(3,2,3), statistic='count', expand_binnumbers=False, range=ranges)
bincounts = ret3.statistic
binnumber = ret3.binnumber.T

>>> binnumber  = array([46, 51, 27, 26, 31, 46, 32, 52, 46, 51, 46, 51, 66, 72, 27, 32, 47,
       52, 32, 47], dtype=int64)


Solution

  • Ok, after several days of background thinking and a quick scour through the binned_statistic_dd() source code I think I've come to the correct answer and it's pretty simple.

    It seem binned_statistic_dd() adds an extra set of outlier bins in the binning phase and then removes these when returning the histogram results, but leaving the bin numbers untouched (I think this is in case you want to reuse the result for further stats outputs).

    So it seems that if you export the expanded binnumbers (expand_binnumbers=True) and then subtract 1 from each binnumber to re-adjust the bin indices you can calculate the "correct" bin ids.

    ret2 = stats.binned_statistic_dd(pos, ids, bins=[xbinEdges, ybinEdges, zbinEdges],
                                    statistic='count', expand_binnumbers=True)
    bincounts2 = ret2.statistic
    binnumber2 = ret2.binnumber
    indxnum2 = binnumber2-1
    corrected_bin_ids = np.ravel_multi_index((indxnum2),(numX, numY, numZ))
    

    Quick and simple in the end!