Search code examples
pythonnumpyscientific-computing

How to compute average within given percentiles in Python?


I am doing some scientific computing and I couldn't find an elegant way of performing the following operation. Suppose I have a 2-dimensional numpy array D which stores measurements of a given quantity at several times along the day. Each row corresponds to a different measuring instrument and each column corresponds to a different moment in the day at which the measurement was done.

Consider a list of desired percentiles. For example:

quantiles = [0.25, 0.5, 0.75]

My goal is to compute the average measurement by percentile group, at each moment in the day. In other words, given a column of measurements, I would like to sort all the measurements from that column in groups respecting the quantiles above and then take averages within groups. Using the example, I would have 4 groups at each moment of the day: the measurements in the lower quartile, then the measurements between the 25th and 50th quartile, the ones between the 50th and the 75th and finally the ones in the last quartile. Therefore, if m is the number of moments in the day when measurements were taken and q is the number of elements in the quantiles variable, my desired output would be qxm numpy array.

Currently, I am doing this in the most inefficient and hard-coded way possible. Here we go:

quantiles = [0.25, 0.5, 0.75]
window = "30min"
moments = pd.date_range(start = "9:30", end = "16:00", freq = window).time
quantile_curves = np.zeros((len(quantiles)+1, len(moments)-1))
EmpQuantiles = np.quantile(D, quantiles, axis = 0)
for moment in range(len(moments)-1):
    quantile_curves[0, moment] = np.mean(D[:, moment][D[:,moment] < EmpQuantiles[0, moment]])
    quantile_curves[1, moment] = np.mean(D[:, moment][np.logical_and(D[:,moment] > EmpQuantiles[0, moment], D[:,moment] <EmpQuantiles[1, moment])])
    quantile_curves[2, moment] = np.mean(D[:, moment][np.logical_and(D[:,moment] > EmpQuantiles[1, moment], D[:,moment] <EmpQuantiles[2, moment])])
    quantile_curves[3, moment] = np.mean(D[:, moment][D[:,moment] > EmpQuantiles[2, moment]])

What's an elegant and simpler way of doing this? I couldn't find the answer here however there is a related (but not the same) question in R: ddply multiple quantiles by group

I intend to plot the evolution of the in-group average along the day. I show the plot I get below (I am satisfied with the plot and I get the result I want however I seek better way of computing the quantile_curves variable):

enter image description here

Thanks a lot in advance!


Solution

  • You can do it efficiently using masked_arrays:

    import numpy as np
    
    quantiles = [0.25, 0.5, 0.75]
    print('quantiles:\n', quantiles)
    
    moments = [f'moment {i}' for i in range(5)]
    print('nb of moments:\n', len(moments))
    nb_measurements = 10000
    D = np.random.rand(nb_measurements,len(moments))
    quantile_values = np.quantile(D,quantiles,axis=0)
    print('quantile_values (for each moment):\n', quantile_values)
    
    quantile_curves = np.zeros((len(quantiles)+1,len(moments)))
    quantile_curves[0, :] = np.mean(np.ma.masked_array(D, mask=D>quantile_values[[0],:]), axis=0)
    for q in range(len(quantiles)-1):
      quantile_curves[q+1, :] = np.mean(np.ma.masked_array(D, mask=np.logical_or(D<quantile_values[[q],:], D>quantile_values[[q+1],:])), axis=0)
    quantile_curves[len(quantiles), :] = np.mean(np.ma.masked_array(D, mask=D<quantile_values[[len(quantiles)-1],:]), axis=0)
    
    print('mean for each group and at each moment:')
    print(quantile_curves)
    

    Output:

    % python3 script.py
    quantiles:
     [0.25, 0.5, 0.75]
    nb of moments:
     5
    quantile_values (for each moment):
     [[0.25271343 0.25434056 0.24658732 0.24612319 0.25221014]
     [0.51114344 0.50103699 0.49671249 0.49113293 0.49819521]
     [0.75629377 0.75427293 0.74676209 0.74211813 0.7490436 ]]
    mean for each group and at each moment
    [[0.12650993 0.12823392 0.12492136 0.12200609 0.12655318]
     [0.3826476  0.373516   0.37050513 0.36974876 0.37722219]
     [0.63454102 0.63023986 0.62280545 0.61696283 0.6238492 ]
     [0.87866019 0.87614489 0.87492553 0.87253142 0.87403426]]
    

    Note that I'm using random values between 0 and 1 that's why the quantile values (extremities of groups intervals) are almost equql to quantiles. Also not that this code works for an arbitrary number of quantiles or moments.