Search code examples
pythonpandasgroup-bypandas-groupbypandas-resample

Pandas dataframe group by 10 min intervals with different actions on other columns


I have a pandas dataframe which includes a timestamp and 71 other columns, something like this:

              timestamp           |close_price|highest_price|volume| ...
              2018-09-29 00:00:20 |1809       |1811         |  ... |
              2018-09-29 00:00:34 |1823       |1832         |
              2018-09-29 00:00:59 |1832       |1863         |
              2018-09-29 00:01:09 |1800       |1802         |
              2018-09-29 00:01:28 |1832       |1845         |
              .
              .
              .

I want to put the data into 10 min intervals and I want to do separate operations on each column, for example I want the 10 min intervals of close_price column to show the last value of the corresponding range in the real table, or for the highest_price column, I want the max value of the corresponding range, or for volume I want the mean of the values in that range. I already tried

dataTable = datefram.resample("10min").agg({'first_price':'first',
                                       'close_price':'last',
                                       'highest_price': 'max',
                                       'volume':'mean', 
                                        #other attributes...
                                        })

but the result seems to be incorrect. Is there any other ways to do what I want to do? I will appreciate any comments or thoughts.

Note that there is no specific pattern in timestamp values. In 1 minute, we can have 0 to 60 rows.


Solution

  • Your approach is correct. The dataframe.resample("10min").agg() does the calculations for you. You might get more outputs than what you expect and that is because of this: resample method continuously adds 10 minutes to the time and does the calculations that you asked. But if there was no data in any of the 10 min intervals, it creates a NULL row. Maybe your data is not continuous and causes this Null rows.

    You can simply delete the NULL rows by using dataframe.dropna()