Search code examples
pythonnumpyscipybinning

How to find bin edges of given bin number returned by scipy.stats.binned_statistic_dd()?


I have a Nx3 array mm. The function call

c,edg,idx = scipy.stats.binned_statistic_dd(mm,[], statistic='count',bins=(30,20,10),rg=((3,5),(2,8),(4,6)))

returns idx, which is a 1d array of ints that represents the bin in which each element of mm falls, and edg is a list of 3 arrays holding the bin edges

What I need is to find the bin edges of a given bin given it's binnumber in idx. For example, given idx=[24,153,...,72] I want to find the edges of say bin 153, i.e. where that bin falls in edg. Of course I can find the elements in bin 153 by mm[153], but not the edges.

I posted this Nx3 case just for clarity. In reality, I am looking for a solution to the NxD case.


Solution

  • It helps to first be familiar with np.unravel_index. It converts a "flat index" (i.e. binnumber!) to a tuple of coordinates. You can think of the flat index as the index into arr.ravel(), and the tuple of coordinates as the index into arr. For example, if in the diagram below we think of the numbers 0,1,2,3,4,5 as bin numbers:

       | 0 | 1 | 2 |
    ---+---+---+---|
     0 | 0 | 1 | 2 |
     1 | 3 | 4 | 5 |
       +---+---+---|
    

    then np.unravel_index(4, (2,3))

    In [65]: np.unravel_index(4, (2,3))
    Out[65]: (1, 1)
    

    equals (1,1) because the 4th bin number in an array of shape (2,3) has coordinate (1,1).

    Okay then. Next, we need to know that internally scipy.stats.binned_statistic_dd adds two edges to the given bin edges to handle outliers:

    bin_edges = [np.r_[-np.inf, edge, np.inf] for edge in bin_edges]
    

    So the edge coordinates corresponding to the bin numbers is given by

    edge_index = np.unravel_index(binnumber, [len(edge)-1 for edge in bin_edges])
    

    (We use len(edge)-1 because the shape of the array axis is one less than the number of edges.)


    For example:

    import itertools as IT
    import numpy as np
    import scipy.stats as stats
    
    sample = np.array(list(IT.product(np.arange(5)-0.5, 
                                      np.arange(5)*10-5, 
                                      np.arange(5)*100-50)))
    bins = [np.arange(4),
            np.arange(4)*10,
            np.arange(4)*100]
    
    statistic, bin_edges, binnumber = stats.binned_statistic_dd(
        sample=sample, values=sample, statistic='count', 
        bins=bins, 
        range=[(0,100)]*3)
    
    bin_edges = [np.r_[-np.inf, edge, np.inf] for edge in bin_edges]
    edge_index = np.unravel_index(binnumber, [len(edge)-1 for edge in bin_edges])
    
    
    for samp, idx in zip(sample, zip(*edge_index)):
        vert = [edge[i] for i, edge in zip(idx, bin_edges)]
        print('{} goes in bin with left-most corner: {}'.format(samp, vert))
    

    yields

    [ -0.5  -5.  -50. ] goes in bin with left-most corner: [-inf, -inf, -inf]
    [ -0.5  -5.   50. ] goes in bin with left-most corner: [-inf, -inf, 0.0]
    [  -0.5   -5.   150. ] goes in bin with left-most corner: [-inf, -inf, 100.0]
    [  -0.5   -5.   250. ] goes in bin with left-most corner: [-inf, -inf, 200.0]
    ...