Search code examples
numpypandasgroup-bygroupingbinning

Double grouping data by bins AND time with pandas


I am trying to bin values from a timeseries (hourly and subhourly temperature values) within a time window. That is, from original hourly values, I'd like to extract binned values on a daily, weekly or monthly basis. I have tried to combine groupby+TimeGrouper(" ") with pd.cut, with poor results.

I have came across a nice function from this tutorial, which suggests to map the data (associating to each value with its mapped range on the next column) and then grouping according to that.

def map_bin(x, bins):
    kwargs = {}
    if x == max(bins):
        kwargs['right'] = True
    bin = bins[np.digitize([x], bins, **kwargs)[0]]
    bin_lower = bins[np.digitize([x], bins, **kwargs)[0]-1]
    return '[{0}-{1}]'.format(bin_lower, bin)

df['Binned'] = df['temp'].apply(map_bin, bins=freq_bins)

However, applying this function results in an IndexError: index n is out of bounds for axis 0 with size n.

Ideally, I'd like make this work and apply it to achieve a double grouping at the same time: one by bins and one by timegrouper.


Solution

  • Update:
    It appears that my earlier attempt was causing problems because of the double-indexed columns. I have simplified to something that seems to work much better.

    import pandas as pd
    import numpy as np
    xaxis = np.linspace(0,50)
    temps = pd.Series(data=xaxis,name='temps')
    times = pd.date_range(start='2015-07-15',periods=50,freq='6H')
    temps.index = times
    
    bins = [0,10,20,30,40,50]
    
    temps.resample('W').agg(lambda series:pd.value_counts(pd.cut(series,bins),sort=False)).unstack()
    

    This outputs:

               (0, 10]  (10, 20]    (20, 30]    (30, 40]    (40, 50]
    2015-07-19       9        10           0           0           0
    2015-07-26       0         0          10          10           8
    2015-08-02       0         0           0           0           2