This is a short, complete example of a more complex real-world application.
Libraries used:
import numpy as np
import scipy as sp
import scipy.stats as scist
import matplotlib.pyplot as plt
from itertools import zip_longest
Data:
I have an array with irregular bins defined with start and end, for example like these (in the real-world case this format is a given as it is the output of another process):
bin_starts = np.array([0, 93, 184, 277, 368])
bin_ends = np.array([89, 178, 272, 363, 458])
which I combine with:
bns = np.stack(zip_longest(bin_starts, bin_ends)).flatten()
bns
>>> array([ 0, 89, 93, 178, 184, 272, 277, 363, 368, 458])
giving a regularly alternating sequence of long and short intervals, all of irregular length.
This is a sketched representation of the given long and short intervals:
I have a bunch of time series data, similar to the random data created below:
# make some random example data to bin
np.random.seed(45)
x = np.arange(0,460)
y = 5+np.random.randn(460).cumsum()
plt.plot(x,y);
Objective:
I would like to use the sequence of intervals to collect statistics (mean, percentiles, etcetera) on the data - but only using the long intervals, i.e. the yellow ones in the sketch.
Assumptions and clarifications:
The edges of long intervals never overlap; in other words, there is always a short interval in between long intervals. Also, the first interval is always a long one.
Current solution:
One way to do it is to use scipy.stats.binned_statistic
on all intervals and then slice the result to only keep every other one (i.e. [::2]
), like this (a great help for some statistics, like np.percentile
, was reading this SO answer by @ali_m):
ave = scist.binned_statistic(x, y,
statistic = np.nanmean,
bins=bns)[0][::2]
which gives me the desired result:
plt.plot(np.arange(0,5), ave);
Question:
Is there a more Pythonic way of doing this (using any of Numpy
, Scipy
or Pandas
)?
I think using some combo of IntervalIndex
, pd.cut
, groupby
, and agg
is a relatively straightforward and easy way to get what you want.
I'd first make the DataFrame (not sure if this is the best way to go from np arrays):
df = pd.DataFrame()
df['x'], df['y'] = x, y
Then you can define your bins as a list of tuples:
bins = list(zip(bin_starts, bin_ends))
Use a pandas IntervalIndex, which has a from_tuples()
method, to create bins to later use in cut
. This is useful because you then don't have to rely on slicing your bns
array to disentangle the "regularly alternating sequence of long and short intervals"-- Instead you can explicitly define the bins you are interested in:
ii = pd.IntervalIndex.from_tuples(bins, closed='both')
The closed
kwarg specifes whether to include the end member numbers in the interval. For example for the tuple (0, 89)
, with closed='both'
the interval will include both 0 and 89 (as opposed to left
, right
, or neither
).
Then create a category column in the dataframe using pd.cut()
, which is a method for binning values into intervals. An IntervalIndex
object can be specified using the bin
kwarg:
df['bin'] = pd.cut(df.x, bins=ii)
Last, use df.groupby()
and .agg()
to get whatever stats you'd like:
df.groupby('bin')['y'].agg(['mean', np.std])
which outputs:
mean std
bin
[0, 89] -4.814449 3.915259
[93, 178] -7.019151 3.912347
[184, 272] 7.223992 5.957779
[277, 363] 15.060402 3.979746
[368, 458] -0.644127 3.361927