Search code examples
pythonnumpymedian

Compute median for numpy histogram2d data


I have some data that I bin in x and y. I then normalize the data in the x bins such that all the data in the xbins sums to 1... so I have a normalized probability for each value of y at each x.

    nA, binsx, binsy = np.histogram2d(dataA,dataB,
                                      bins=[binsA,binsB],normed=False)

    H = np.ma.masked_where(nA==0.0, nA)
    for i in range(len(H[0,:])):     # Column index i, over len of row 0
        colTot = np.sum(H[:,i])
        for j in range(len(H[:,0])): # Row index j, over len of column 0
            H[j,i] = H[j,i]/colTot

At this point H is normalized along columns... each sums to 1.

My question is, how can I efficiently generate the median value in each column? I believe I need to generate a new array, for each column (or set of values in an xbin) that has a number of y values equal to the original (nA) count for that ybin. Seems convoluted... is there an easier way?

Here's what I'm trying now:

nA, binsx, binsy = np.histogram2d(dataA,dataB,
                                  bins=[binsA,binsB],normed=False)
for j in range(nA[0,:].size): # Loop over number of columns
    oneMass = np.array([])
    for i in range(nA[:,0].size): # loop over rows in y...
        tmp = np.repeat(binsA[i],np.int32(nA[i,j]))
        if  tmp.size > 0:
            oneMass = np.concatenate((oneMass,tmp) )

    print('Median',np.median(oneMass))

Solution

  • If you've already normalized the columns, You could just do a linear interpolation to .5 over the cumulative probability function:

    cumCols = np.cumsum(H, axis = 1)
    medians = np.array([np.interp(.5, binsA, cumCols[:,i]) for i in range(len(binsA))])