Search code examples
pythonpandasdictionarydataframechunks

Pandas: chunking of a dictionary of dataframes based on time


I have a dictionary of dataframes where each dataframe has price and timestamp columns. something like this {'A':df1, 'B':df2}

I need to build a function which can slice up the dataframes inside the dict in chunks of H hours of the timestamp and then pass this dict of dataframes to another function(which does some computation) for every chunk.

How do I go forward with this?

For example

def foo(dict_of_dataframes):
    for id, df in dict_of_dataframes.items():
        do_something()

def boo(dict_of_dataframes, chunksize):
    """
    Needs to chunk up the @dict_of_dataframes in @chunksize hours
    and needs to call foo function on these chunks of
    @dicts_of_dataframes
    """

Sample data:

df1:
Time                       Price
2017-03-07 09:47:31+00:00  100
2017-03-07 11:27:31+00:00  120
2017-03-07 14:47:31+00:00  150
2017-03-07 17:17:31+00:00  135
2017-03-07 20:57:31+00:00  200
2017-03-08 03:27:31+00:00  120
2017-03-08 09:57:31+00:00  100
2017-03-08 11:27:31+00:00  150

df2:
Time                       Price
2017-03-07 09:07:31+00:00  200
2017-03-07 10:27:31+00:00  300
2017-03-07 12:47:31+00:00  100
2017-03-07 17:47:31+00:00  250
2017-03-07 22:27:31+00:00  300
2017-03-08 01:57:31+00:00  500
2017-03-08 02:57:31+00:00  500
2017-03-08 10:27:31+00:00  100

I need help with the boo function. How does one go forward with this?

Also is there any specific term for these kinds of boo functions which simulate other function calling. I've seen these a few times, If you could point to a resource which explains how to design these 'function caller' functions, I'd really appreciate that.


Solution

  • I think what you actually want can be achieved using resample - basically a groupby for datetimes. Assuming you need transaction sum within 6 hours, you can use this:

    def boo(dict_dfs, hours):
        return {k: v.resample(f'{hours}H').sum() for k,v in dict_dfs.items()}
    

    now, if you 100% sure you need dicts instead, use groupby:

    def boo(dict_dfs, hours):
        return {k:{hr:v for hr, v in df.groupby(Grouper(key='Time', freq=f'{hours}H'))} for k, df in dict_dfs.items()}
    

    Btw, if you want to loop through {key, value} on dicts, use dict.items(), not dict itself.

    And one more note: I saw many times people overcomplicating their data structures. Most of the time you don't need dict of dataframes - you can use one dataframe, just having a category column or even a multi-index (like, [category, Time] multi-index in your case. With that, you'll get more reusable, fast and clean code!