Search code examples
pythonpandasdataframerolling-computation

Pandas rolling apply using multiple columns


I am trying to use a pandas.DataFrame.rolling.apply() rolling function on multiple columns. Python version is 3.7, pandas is 1.0.2.

import pandas as pd

#function to calculate
def masscenter(x):
    print(x); # for debug purposes
    return 0;

#simple DF creation routine
df = pd.DataFrame( [['02:59:47.000282', 87.60, 739],
                    ['03:00:01.042391', 87.51, 10],
                    ['03:00:01.630182', 87.51, 10],
                    ['03:00:01.635150', 88.00, 792],
                    ['03:00:01.914104', 88.00, 10]], 
                   columns=['stamp', 'price','nQty'])
df['stamp'] = pd.to_datetime(df2['stamp'], format='%H:%M:%S.%f')
df.set_index('stamp', inplace=True, drop=True)

'stamp' is monotonic and unique, 'price' is double and contains no NaNs, 'nQty' is integer and also contains no NaNs.

So, I need to calculate rolling 'center of mass', i.e. sum(price*nQty)/sum(nQty).

What I tried so far:

df.apply(masscenter, axis = 1)

masscenter is be called 5 times with a single row and the output will be like

price     87.6
nQty     739.0
Name: 1900-01-01 02:59:47.000282, dtype: float64

It is desired input to a masscenter, because I can easily access price and nQty using x[0], x[1]. However, I stuck with rolling.apply() Reading the docs DataFrame.rolling() and rolling.apply() I supposed that using 'axis' in rolling() and 'raw' in apply one achieves similiar behaviour. A naive approach

rol = df.rolling(window=2)
rol.apply(masscenter)

prints row by row (increasing number of rows up to window size)

stamp
1900-01-01 02:59:47.000282    87.60
1900-01-01 03:00:01.042391    87.51
dtype: float64

then

stamp
1900-01-01 02:59:47.000282    739.0
1900-01-01 03:00:01.042391     10.0
dtype: float64

So, columns is passed to masscenter separately (expected).

Sadly, in the docs there is barely any info about 'axis'. However the next variant was, obviously

rol = df.rolling(window=2, axis = 1)
rol.apply(masscenter)

Never calls masscenter and raises ValueError in rol.apply(..)

> Length of passed values is 1, index implies 5

I admit that I'm not sure about 'axis' parameter and how it works due to lack of documentation. It is the first part of the question: What is going on here? How to use 'axis' properly? What it is designed for?

Of course, there were answers previously, namely:

How-to-apply-a-function-to-two-columns-of-pandas-dataframe
It works for the whole DataFrame, not Rolling.

How-to-invoke-pandas-rolling-apply-with-parameters-from-multiple-column
The answer suggests to write my own roll function, but the culprit for me is the same as asked in comments: what if one needs to use offset window size (e.g. '1T') for non-uniform timestamps?
I don't like the idea to reinvent the wheel from scratch. Also I'd like to use pandas for everything to prevent inconsistency between sets obtained from pandas and 'self-made roll'. There is another answer to that question, suggessting to populate dataframe separately and calculate whatever I need, but it will not work: the size of stored data will be enormous. The same idea presented here:
Apply-rolling-function-on-pandas-dataframe-with-multiple-arguments

Another Q & A posted here
Pandas-using-rolling-on-multiple-columns
It is good and the closest to my problem, but again, there is no possibility to use offset window sizes (window = '1T').

Some of the answers were asked before pandas 1.0 came out, and given that docs could be much better, I hope it is possible to roll over multiple columns simultaneously now.

The second part of the question is: Is there any possibility to roll over multiple columns simultaneously using pandas 1.0.x with offset window size?


Solution

  • How about this:

    import pandas as pd
    
    def masscenter(ser: pd.Series, df: pd.DataFrame):
        df_roll = df.loc[ser.index]
        return your_actual_masscenter(df_roll)
    
    masscenter_output = df['price'].rolling(window=3).apply(masscenter, args=(df,))
    

    It uses the rolling logic to get subsets via an arbitrary column. The arbitrary column itself is not used, only the rolling index is used. This relies on the default of raw=False which provides the index values for those subsets. The applied function uses those index values to get multi-column slices from the original dataframe.