Search code examples
pythonpandaspandas-resample

How to speed up resample idxmax and idxmin column calculations in pandas?


I have 1 minute ohlcv data in a pandas dataframe. I want to resample it with 5 minute intervals and see if the high of the 5 minutes was hit first, or the low, all the while keeping the ohlcv values.

Input is like

datetime open high low close
2022-01-01 10:00:00 10 12 9 11
2022-01-01 10:01:00 11 14 9 12
2022-01-01 10:02:00 12 12 8 10
2022-01-01 10:03:00 10 15 9 11
2022-01-01 10:04:00 10 12 8 11
2022-01-01 10:05:00 11 11 9 10

Output is like

datetime open high low close high_first
2022-01-01 10:00:00 10 15 8 11 0
2022-01-01 10:05:00 11 11 9 10 1

First, I tried the simple way:

I would find the indexes of where high would reach its max and low would reach its min, then I would add another column comparing those.

df.resample("15min").agg({"high": ["idxmax", "max"], "low": ["idxmin", "min"]})

But I got an error:

ValueError: attempt to get argmax of an empty sequence

Because my data is not continuous (it cuts off at holidays)

So, left to my own devices, I wrote an apply function:

def transform(x):
    x["open_first"] = x["open"][0]
    x["close_last"] = x["close"][-1]
    x["high_max"] = x["high"].max()
    x["low_min"] = x["low"].min()
    x["high_idxmax"] = x["high"].idxmax()
    x["low_idxmin"] = x["low"].idxmin()
    x["volume_sum"] = x["volume"].sum()
    x["high_first"] = x["high_idxmax"] < x["low_idxmin"]
    return x

But it is very slow. Is it possible to make it faster and avoid df.apply ?


Solution

  • You can use DatetimeIndex.floor for 5 minutes values with not continuous values of DatetimeIndex:

    df1 = (df.groupby(df.index.floor('5min'))
            .agg(open_first=('open','first'),
                 close_last=('close','last'),
                 high_max=('high','max'),
                 low_min=('low','min'),
                 high_idxmax=('high','idxmax'),
                 low_idxmin=('low','idxmin'))
            .assign(high_first = lambda x: x["high_idxmax"] < x["low_idxmin"]))
    print (df1)
                         open_first  close_last  high_max  low_min  \
    datetime                                                         
    2022-01-01 10:00:00          10          11        15        8   
    2022-01-01 10:05:00          11          10        11        9   
    
                                high_idxmax          low_idxmin  high_first  
    datetime                                                                 
    2022-01-01 10:00:00 2022-01-01 10:03:00 2022-01-01 10:02:00       False  
    2022-01-01 10:05:00 2022-01-01 10:05:00 2022-01-01 10:05:00       False  
    

    Your solution create consecutive index, for idxmax/idmin are necessary lambda functions:

    df1 = (df.groupby(pd.Grouper(freq='5Min'))
            .agg(open_first=('open','first'),
                 close_last=('close','last'),
                 high_max=('high','max'),
                 low_min=('low','min'),
                 high_idxmax=('high',lambda x : np.nan if x.count() == 0 else x.idxmax()),
                 low_idxmin=('low',lambda x : np.nan if x.count() == 0 else x.idxmin()))
            .assign(high_first = lambda x: x["high_idxmax"] < x["low_idxmin"]))
    print (df1)