Search code examples
pythonpandaspandas-resampleohlc

How to resample ohlc data properly in pandas / custom fill method per column


I have got OHLC data with missing time frames. Suppose I have the following pandas dataframe denoted by the variable df:

                     Open     High     Low      Close
2019-04-19 00:00:00  0.67068  0.67123  0.67064  0.67123
2019-04-19 00:02:00  0.67062  0.67425  0.67060  0.67223

Now, I resample that pandas dataframe to fill the missing gap and I get the following:

df = df.resample('T').ffill()

                     Open     High     Low      Close
2019-04-19 00:00:00  0.67068  0.67123  0.67064  0.67123
2019-04-19 00:01:00  0.67068  0.67123  0.67064  0.67123
2019-04-19 00:02:00  0.67062  0.67425  0.67060  0.67223

From the above, we can see that the missing gap (00:01:00) is filled with the help of ffill(). However, the data in that row (row starting with 00:01:00) is not displayed properly as the opening price should be the same as the closing price of the previous row (row starting with 00:00:00). Likewise, the closing price of that row (row starting with 00:01:00) should be the same as the opening price of the next row (row starting with 00:02:00). The desired output should look like this:

                     Open     High     Low      Close
2019-04-19 00:00:00  0.67068  0.67123  0.67064  0.67123
2019-04-19 00:01:00  0.67123  0.67123  0.67064  0.67062
2019-04-19 00:02:00  0.67062  0.67425  0.67060  0.67223

How would I resolve this problem in pandas?


Solution

  • Unfortunately, you can't directly specify a fill method per column.

    A workaround would be not to fill the values during the resampling but to do it afterwards:

    df = df.resample('T').fillna(None)
    
    df['Open'], df['Close'] = (df['Open'].fillna(df['Close'].ffill()),
                               df['Close'].fillna(df['Open'].bfill()))
    
    df = df.ffill()
    

    output:

                            Open     High      Low    Close
    2019-04-19 00:00:00  0.67068  0.67123  0.67064  0.67123
    2019-04-19 00:01:00  0.67123  0.67123  0.67064  0.67062
    2019-04-19 00:02:00  0.67062  0.67425  0.67060  0.67223
    

    previous answer (incorrect for OHLC but interesting as generalization)

    (df.resample('T')
       .fillna(None)
       .assign(Close=lambda d: d['Close'].bfill())  # bfill for Close
       .ffill()                                     # ffill for others
    )
    

    output:

                            Open     High      Low    Close
    2019-04-19 00:00:00  0.67068  0.67123  0.67064  0.67123
    2019-04-19 00:01:00  0.67068  0.67123  0.67064  0.67223
    2019-04-19 00:02:00  0.67062  0.67425  0.67060  0.67223
    
    cross filling the values:
    (df.resample('T')
       .fillna(None)
       .assign(Open=lambda d: d['Open'].fillna(d['Close'].ffill())) # Open = last Close
       .ffill()  # ffill the others
    )
    

    output:

                            Open     High      Low    Close
    2019-04-19 00:00:00  0.67068  0.67123  0.67064  0.67123
    2019-04-19 00:01:00  0.67123  0.67123  0.67064  0.67123
    2019-04-19 00:02:00  0.67062  0.67425  0.67060  0.67223
    
    more options

    Here is another example where we'll interpolate High and leave Low as NaNs:

    (df.resample('T')
       .fillna(None)
       .assign(Open=lambda d: d['Open'].ffill(),
               Close=lambda d: d['Close'].bfill(),
               High=lambda d: d['High'].interpolate()
              )
    )
    

    output:

                            Open     High      Low    Close
    2019-04-19 00:00:00  0.67068  0.67123  0.67064  0.67123
    2019-04-19 00:01:00  0.67068  0.67274      NaN  0.67223
    2019-04-19 00:02:00  0.67062  0.67425  0.67060  0.67223