Binned statistics with irregular and alternating bins

This is a short, complete example of a more complex real-world application.

Libraries used:

import numpy as np
import scipy as sp
import scipy.stats as scist
import matplotlib.pyplot as plt
from itertools import zip_longest

Data:

I have an array with irregular bins defined with start and end, for example like these (in the real-world case this format is a given as it is the output of another process):

bin_starts = np.array([0, 93, 184, 277, 368])
bin_ends = np.array([89, 178, 272, 363, 458])

which I combine with:

bns = np.stack(zip_longest(bin_starts, bin_ends)).flatten()
bns
>>> array([  0,  89,  93, 178, 184, 272, 277, 363, 368, 458])

giving a regularly alternating sequence of long and short intervals, all of irregular length. This is a sketched representation of the given long and short intervals:

I have a bunch of time series data, similar to the random data created below:

# make some random example data to bin
np.random.seed(45)
x = np.arange(0,460)
y = 5+np.random.randn(460).cumsum()
plt.plot(x,y);

Objective:

I would like to use the sequence of intervals to collect statistics (mean, percentiles, etcetera) on the data - but only using the long intervals, i.e. the yellow ones in the sketch.

Assumptions and clarifications:

The edges of long intervals never overlap; in other words, there is always a short interval in between long intervals. Also, the first interval is always a long one.

Current solution:

One way to do it is to use scipy.stats.binned_statistic on all intervals and then slice the result to only keep every other one (i.e. [::2]), like this (a great help for some statistics, like np.percentile, was reading this SO answer by @ali_m):

ave = scist.binned_statistic(x, y, 
                         statistic = np.nanmean, 
                         bins=bns)[0][::2]

which gives me the desired result:

plt.plot(np.arange(0,5), ave);

Question: Is there a more Pythonic way of doing this (using any of Numpy, Scipy or Pandas)?

Solution

I think using some combo of IntervalIndex, pd.cut, groupby, and agg is a relatively straightforward and easy way to get what you want.

I'd first make the DataFrame (not sure if this is the best way to go from np arrays):

df = pd.DataFrame()
df['x'], df['y'] = x, y

Then you can define your bins as a list of tuples:

bins = list(zip(bin_starts, bin_ends))

Use a pandas IntervalIndex, which has a from_tuples() method, to create bins to later use in cut. This is useful because you then don't have to rely on slicing your bns array to disentangle the "regularly alternating sequence of long and short intervals"-- Instead you can explicitly define the bins you are interested in:

ii = pd.IntervalIndex.from_tuples(bins, closed='both')

The closed kwarg specifes whether to include the end member numbers in the interval. For example for the tuple (0, 89), with closed='both' the interval will include both 0 and 89 (as opposed to left, right, or neither).

Then create a category column in the dataframe using pd.cut(), which is a method for binning values into intervals. An IntervalIndex object can be specified using the bin kwarg:

df['bin'] = pd.cut(df.x, bins=ii)

Last, use df.groupby() and .agg() to get whatever stats you'd like:

df.groupby('bin')['y'].agg(['mean', np.std])

which outputs:

                 mean       std
bin                            
[0, 89]     -4.814449  3.915259
[93, 178]   -7.019151  3.912347
[184, 272]   7.223992  5.957779
[277, 363]  15.060402  3.979746
[368, 458]  -0.644127  3.361927