Search code examples
pandasgroupingresample

Pandas - group rows on date condition (part 2)


I recently asked a question regarding grouping pandas rows on a date condition, and got an excellent answer see here

However, it got me thinking that if this functionality was placed into a common function, then there is a scenario where this would not work.

So, using the same data from the previous question...

             date_time  day  value1  value2
0  2023-03-15 00:00:00    3       1       1
1  2023-03-15 06:00:00    3       2       2
2  2023-03-15 12:00:00    3       3       3
3  2023-03-15 18:00:00    3       4       4
4  2023-03-16 00:00:00    4       5       5
5  2023-03-16 06:00:00    4       6       6
6  2023-03-16 12:00:00    4       7       7
7  2023-03-16 18:00:00    4       8       8
8  2023-03-17 00:00:00    5       9       9
9  2023-03-17 06:00:00    5      10      10
10 2023-03-17 12:00:00    5      11      11
11 2023-03-17 18:00:00    5      12      12
12 2023-03-20 06:00:00    1      13      13

If I perform the same resampling, except without an offset...

agg_dict = {
    "date_time": "last",
    "day": "last",
    "value1": "first",
    "value2": "last",
}
rows = df.resample("24H", on="date_time").agg(agg_dict).dropna()

This gives me...

                     date_time  day  value1  value2
date_time                                          
2023-03-15 2023-03-15 18:00:00  3.0     1.0     4.0
2023-03-16 2023-03-16 18:00:00  4.0     5.0     8.0
2023-03-17 2023-03-17 18:00:00  5.0     9.0    12.0
2023-03-20 2023-03-20 06:00:00  1.0    13.0    13.0

...which is correct, as all the data is grouped into the relevant days.

But if I then run the following commands on it (which is required when there is an offset), then the data would no longer be correct...

island = rows.index.to_series().diff().le("1D").cumsum().rename("island")
rows = rows.groupby(island).agg(agg_dict)

...the data from the last 2 rows (Friday/Monday) get grouped into a single row...

                 date_time  day  value1  value2
island                                         
0      2023-03-15 18:00:00  3.0     1.0     4.0
1      2023-03-16 18:00:00  4.0     5.0     8.0
2      2023-03-20 06:00:00  1.0     9.0    13.0

I could perform some if logic to either perform the last 2 commands or not, but that doesn't seem a very elegant way of doing it. It would be good if both scenario's returned the correct results from the same test(s).

In a nutshell, I am trying to group data into 24H blocks. Any data that crosses a day boundary should be joined to the adjacent days data to complete the 24H block (including weekends).

So basically, if a 24H block of data roles over into a new day, it was only a problem when it crossed a gap, i.e. weekend, which has now beened solved, but any data that does not cross a day boundary, i.e. the data starts and ends on a Friday, should not be joined with Mondays data.

So if the data crosses a gap, I need to use the gap-n-island solution, but if the data doesn't cross a gap, then the data should be left as-is.

        Friday                  Monday
|                      ||                      |
    XXXXXXXXXXXXXXXXXXXXXXXX - gap crossed - join data
XXXXXXXXXXXXXXXXXXXXXXXX - gap not crossed - do nothing

Could (and is there) an additional test that can be added into the gap-n-island solution code somehow to allow both scenario's to work?


Solution

  • I think I finally figure out you logic: after the resample, if the date_time column goes into the next day then group that row with the next day.

    You can tweak the island logic:

    # if a row's date_time is on or before next row's date_time, give
    # the row its own island. Else include it in next row's island.
    island = (
        rows["date_time"]
        .le(rows.index.to_series().shift(-1, fill_value=pd.Timestamp.max))
        .cumsum()
        .rename("island")
    )
    rows.groupby(island).agg(agg_dict)