Search code examples
pythongroup-bypandasaggregatedata-analysis

Customizing rolling_apply function in Python pandas


Setup

I have a DataFrame with three columns:

  • "Category" contains True and False, and I have done df.groupby('Category') to group by these values.
  • "Time" contains timestamps (measured in seconds) at which values have been recorded
  • "Value" contains the values themselves.

At each time instance, two values are recorded: one has category "True", and the other has category "False".

Rolling apply question

Within each category group, I want to compute a number and store it in column Result for each time. Result is the percentage of values between time t-60 and t that fall between 1 and 3.

The easiest way to accomplish this is probably to calculate the total number of values in that time interval via rolling_count, then execute rolling_apply to count only the values from that interval that fall between 1 and 3.

Here is my code so far:

groups = df.groupby(['Category'])
for key, grp in groups:
    grp = grp.reindex(grp['Time']) # reindex by time so we can count with rolling windows
    grp['total'] = pd.rolling_count(grp['Value'], window=60) # count number of values in the last 60 seconds
    grp['in_interval'] = ? ## Need to count number of values where 1<v<3 in the last 60 seconds

    grp['Result'] = grp['in_interval'] / grp['total'] # percentage of values between 1 and 3 in the last 60 seconds

What is the proper rolling_apply() call to find grp['in_interval']?


Solution

  • Let's work through an example:

    import pandas as pd
    import numpy as np
    np.random.seed(1)
    
    def setup(regular=True):
        N = 10
        x = np.arange(N)
        a = np.arange(N)
        b = np.arange(N)
    
        if regular:
            timestamps = np.linspace(0, 120, N)
        else:
            timestamps = np.random.uniform(0, 120, N)
    
        df = pd.DataFrame({
            'Category': [True]*N + [False]*N,
            'Time': np.hstack((timestamps, timestamps)),
            'Value': np.hstack((a,b))
            })
        return df
    
    df = setup(regular=False)
    df.sort(['Category', 'Time'], inplace=True)
    

    So the DataFrame, df, looks like this:

    In [4]: df
    Out[4]: 
       Category       Time  Value    Result
    12    False   0.013725      2  1.000000
    15    False  11.080631      5  0.500000
    14    False  17.610707      4  0.333333
    16    False  22.351225      6  0.250000
    13    False  36.279909      3  0.400000
    17    False  41.467287      7  0.333333
    18    False  47.612097      8  0.285714
    10    False  50.042641      0  0.250000
    19    False  64.658008      9  0.125000
    11    False  86.438939      1  0.333333
    2      True   0.013725      2  1.000000
    5      True  11.080631      5  0.500000
    4      True  17.610707      4  0.333333
    6      True  22.351225      6  0.250000
    3      True  36.279909      3  0.400000
    7      True  41.467287      7  0.333333
    8      True  47.612097      8  0.285714
    0      True  50.042641      0  0.250000
    9      True  64.658008      9  0.125000
    1      True  86.438939      1  0.333333
    

    Now, copying @herrfz, let's define

    def between(a, b):
        def between_percentage(series):
            return float(len(series[(a <= series) & (series < b)])) / float(len(series))
        return between_percentage
    

    between(1,3) is a function which takes a Series as input and returns the fraction of its elements which lie in the half-open interval [1,3). For example,

    In [9]: series = pd.Series([1,2,3,4,5])
    
    In [10]: between(1,3)(series)
    Out[10]: 0.4
    

    Now we are going to take our DataFrame, df, and group by Category:

    df.groupby(['Category'])
    

    For each group in the groupby object, we will want to apply a function:

    df['Result'] = df.groupby(['Category']).apply(toeach_category)
    

    The function, toeach_category, will take a (sub)DataFrame as input, and return a DataFrame as output. The entire result will be assigned to a new column of df called Result.

    Now what exactly must toeach_category do? If we write toeach_category like this:

    def toeach_category(subf):
        print(subf)
    

    then we see each subf is a DataFrame such as this one (when Category is False):

       Category       Time  Value    Result
    12    False   0.013725      2  1.000000
    15    False  11.080631      5  0.500000
    14    False  17.610707      4  0.333333
    16    False  22.351225      6  0.250000
    13    False  36.279909      3  0.400000
    17    False  41.467287      7  0.333333
    18    False  47.612097      8  0.285714
    10    False  50.042641      0  0.250000
    19    False  64.658008      9  0.125000
    11    False  86.438939      1  0.333333
    

    We want to take the Times column, and for each time, apply a function. That's done with applymap:

    def toeach_category(subf):
        result = subf[['Time']].applymap(percentage)
    

    The function percentage will take a time value as input, and return a value as output. The value will be the fraction of rows with values between 1 and 3. applymap is very strict: percentage can not take any other arguments.

    Given a time t, we can select the Values from subf whose times are in the half-open interval (t-60, t] using the ix method:

    subf.ix[(t-60 < subf['Time']) & (subf['Time'] <= t), 'Value']
    

    And so we can find the percentage of those Values between 1 and 3 by applying between(1,3):

    between(1,3)(subf.ix[(t-60 < subf['Time']) & (subf['Time'] <= t), 'Value'])
    

    Now remember that we want a function percentage which takes t as input and returns the above expression as output:

    def percentage(t):
        return between(1,3)(subf.ix[(t-60 < subf['Time']) & (subf['Time'] <= t), 'Value'])
    

    But notice that percentage depends on subf, and we are not allowed to pass subf to percentage as an argument (again, because applymap is very strict).

    So how do we get out of this jam? The solution is to define percentage inside toeach_category. Python's scoping rules say that a bare name like subf is first looked for in the Local scope, then the Enclosing scope, the the Global scope, and lastly in the Builtin scope. When percentage(t) is called, and Python encounters subf, Python first looks in the Local scope for the value of subf. Since subf is not a local variable in percentage, Python looks for it in the Enclosing scope of the function toeach_category. It finds subf there. Perfect. That is just what we need.

    So now we have our function toeach_category:

    def toeach_category(subf):
        def percentage(t):
            return between(1, 3)(
                subf.ix[(t - 60 < subf['Time']) & (subf['Time'] <= t), 'Value'])
        result = subf[['Time']].applymap(percentage)
        return result
    

    Putting it all together,

    import pandas as pd
    import numpy as np
    np.random.seed(1)
    
    
    def setup(regular=True):
        N = 10
        x = np.arange(N)
        a = np.arange(N)
        b = np.arange(N)
    
        if regular:
            timestamps = np.linspace(0, 120, N)
        else:
            timestamps = np.random.uniform(0, 120, N)
    
        df = pd.DataFrame({
            'Category': [True] * N + [False] * N,
            'Time': np.hstack((timestamps, timestamps)),
            'Value': np.hstack((a, b))
        })
        return df
    
    
    def between(a, b):
        def between_percentage(series):
            return float(len(series[(a <= series) & (series < b)])) / float(len(series))
        return between_percentage
    
    
    def toeach_category(subf):
        def percentage(t):
            return between(1, 3)(
                subf.ix[(t - 60 < subf['Time']) & (subf['Time'] <= t), 'Value'])
        result = subf[['Time']].applymap(percentage)
        return result
    
    
    df = setup(regular=False)
    df.sort(['Category', 'Time'], inplace=True)
    df['Result'] = df.groupby(['Category']).apply(toeach_category)
    print(df)
    

    yields

       Category       Time  Value    Result
    12    False   0.013725      2  1.000000
    15    False  11.080631      5  0.500000
    14    False  17.610707      4  0.333333
    16    False  22.351225      6  0.250000
    13    False  36.279909      3  0.200000
    17    False  41.467287      7  0.166667
    18    False  47.612097      8  0.142857
    10    False  50.042641      0  0.125000
    19    False  64.658008      9  0.000000
    11    False  86.438939      1  0.166667
    2      True   0.013725      2  1.000000
    5      True  11.080631      5  0.500000
    4      True  17.610707      4  0.333333
    6      True  22.351225      6  0.250000
    3      True  36.279909      3  0.200000
    7      True  41.467287      7  0.166667
    8      True  47.612097      8  0.142857
    0      True  50.042641      0  0.125000
    9      True  64.658008      9  0.000000
    1      True  86.438939      1  0.166667