Search code examples
pandasapply

Why does rolling.apply report an error? How should I modify the code


I have a dataframe like this:

    time_node   rate_node
  0   30          1.67
  1   60          1.82
  2   90          2.13
  3   180         2.53
  4   270         2.68
  5   360         2.71

I want to use window to achieve such calculations:Starting from the second line, which is the line with index equal to 1, (time_node[index] * rate_node[index] - time_node[index - 1] * rate_node[index - 1])/(time[index] - time[index- 1]).

For instance,the second line should be calculated like this: (60 * 1.82 - 30 * 1.67)/(60 - 30); the third row: (90 * 2.13 - 60 * 1.82)/(90 - 60),and add a column of calculation results.

My code is as follows:

nodes = [30, 60, 90, 180, 270, 360]
rate = [1.67000, 1.82000, 2.13000, 2.53000, 2.68000, 2.71000]

df = pd.DataFrame({'time_node': nodes, 'rate_node': rate})

def forward_rate(rate, time):
    denominator = rate[1] * time[1] - rate[0] * time[0]
    numerator = time[1] - time[0]
    return denominator/numertor
df.rolling(2).apply(forward_rate, {'rate': 'rate_node', 'time': 
'time_node'})

I want to use rolling.apply to implement calculations,but it's raise an ERROR,I've been reading the documentation for a long time, but I still don't know where the code is wrong. Please help me.


Solution

  • Solution in the context of the problem

    I guess you're looking for something like this:

    df['...'] = (df['rate_node']*df['time_node']).diff() / df['time_node'].diff()
    

    with the following output on test data:

    0     NaN
    1    1.97
    2    2.75
    3    2.93
    4    2.98
    5    2.80
    dtype: float64
    

    Note, that rolling windows will loop over pairs along each column separately when aggregating data with the apply method. The second positional variable {'rate': 'rate_node', 'time': 'time_node'} will be interpreted as the raw parameter, which should be boolean. That's why you're probably seeing ValueError: raw parameter must be True or False.

    See the description of Rolling.apply

    General case

    One of the solutions in the general case can be producing data with an outer for-loop. For example:

    name = 'new column'
    df[name] = float('nan')
    for index, window in zip(df.index, df.rolling(2)):
        if len(window) == 2:
            rate = window['rate_node'].values
            time = window['time_node'].values
            df.loc[index, name] = forward_rate(rate, time)
    

    where forward_rate is the original function. Note, that we need to pass a numpy ndarray as an arguments of forwar_rate in order to make it work, because otherwise the original data indexes of passed data will be confused with their local positional indexing.

    As alternative, we can try sliding windows from numpy:

    from numpy.lib.stride_tricks import sliding_window_view
    
    data = sliding_window_view(df[['rate_node', 'time_node']], window_shape=2, axis=0)
    
    rate = data[:,0,:].T
    time = data[:,1,:].T
    
    df.loc[df.index[1:], 'new_column'] = forward_rate(rate, time)
    

    where forward_rate is the original function.

    p.s. Rolling window with Numba engine

    If Numba is installed, then we can use rolling with method='table' and apply with engine='numba' in order to perform the windowing operation over all columns at once. In this case, the aggregating function should work with 2D arrays and return the answer which can be broadcasted along the columns. At the end we get a new DataFrame with the same structure as the original one. So in case if the function returns a number, we have to restrict the answer to any column of the result. For example:

    df['something_new'] = (
        df[['rate_node','time_node']]     # get columns in the right order
        .rolling(2, method='table')       # pass method='table' to work with all columns at once
        .apply(lambda x: (x[1,1]*x[1,0]-x[0,1]*x[0,0])/(x[1,1]-x[0,1]),   
               raw=True, engine='numba')  # x is indexed as a numpy.ndarray when engine='numba'
    ).iloc[:, 0]                          # use only one of the returned columns