I have a pandas data frame like below that contains date-time values in column B
.
import pandas as pd
data = {'A': ['XYZ', 'XYZ', 'XYZ', 'XYZ', 'PQR', 'PQR', 'PQR', 'PQR', 'CVB', 'CVB', 'CVB', 'CVB'],
'B': ['2022-02-16 14:00:31', '2022-02-16 16:11:26', '2022-02-16 17:31:26', '2022-02-16 22:47:46', '2022-02-17 07:11:11', '2022-02-17 10:43:36', '2022-02-17 15:05:11', '2022-02-18 18:06:12', '2022-02-19 09:05:46', '2022-02-19 13:02:16', '2022-02-19 18:05:26', '2022-02-19 22:05:26']}
df = pd.DataFrame(data)
df['B'] = pd.to_datetime(df['B'])
df
| A | B |
+-------+----------------------+
| XYZ | 2022-02-16 14:00:31 |
| XYZ | 2022-02-16 16:11:26 |
| XYZ | 2022-02-16 17:31:26 |
| XYZ | 2022-02-16 22:47:46 |
| PQR | 2022-02-17 07:11:11 |
| PQR | 2022-02-17 10:43:36 |
| PQR | 2022-02-17 15:05:11 |
| PQR | 2022-02-18 18:06:12 |
| CVB | 2022-02-19 09:05:46 |
| CVB | 2022-02-19 13:02:16 |
| CVB | 2022-02-19 18:05:26 |
| CVB | 2022-02-19 22:05:26 |
+-------+----------------------+
I want to group the date-time values for 24hr interval period, such that my output should look like below.
Expected Output :
| A | B | Group |
+-------+----------------------+-----------
| XYZ | 2022-02-16 14:00:31 | 1 |
| XYZ | 2022-02-16 16:11:26 | 1 |
| XYZ | 2022-02-16 17:31:26 | 1 |
| XYZ | 2022-02-16 22:47:46 | 1 |
| PQR | 2022-02-17 07:11:11 | 1 |
| PQR | 2022-02-17 10:43:36 | 1 |
| PQR | 2022-02-17 15:05:11 | 2 |
| PQR | 2022-02-18 18:06:12 | 3 |
| CVB | 2022-02-19 09:05:46 | 3 |
| CVB | 2022-02-19 13:02:16 | 3 |
| CVB | 2022-02-19 18:05:26 | 3 |
| CVB | 2022-02-19 22:05:26 | 4 |
+-------+----------------------+----------+
Currently, I tried to group the date-time values in column B
for the 24hr interval time period using the below code, which resulted in an unsuccessful output that I'm not expecting. My 24hr period will first start from the initial datetime i.e 2022-02-16 14:00:31
and then the next 24hr period will start from 2022-02-17 15:05:11
and so on.
df1 = df.reset_index().set_index(df['B']).rename_axis(None)
df1.loc[df1.first('24h').index, "GROUP"] = 1
df1
Actual Output :
| A | B | Group |
+-------+----------------------+------------+
| XYZ | 2022-02-16 14:00:31 | 1 |
| XYZ | 2022-02-16 16:11:26 | 1 |
| XYZ | 2022-02-16 17:31:26 | 1 |
| XYZ | 2022-02-16 22:47:46 | 1 |
| PQR | 2022-02-17 07:11:11 | 1 |
| PQR | 2022-02-17 10:43:36 | 1 |
| PQR | 2022-02-17 15:05:11 | NaN |
| PQR | 2022-02-18 18:06:12 | NaN |
| CVB | 2022-02-19 09:05:46 | NaN |
| CVB | 2022-02-19 13:02:16 | NaN |
| CVB | 2022-02-19 18:05:26 | NaN |
| CVB | 2022-02-19 22:05:26 | NaN |
+-------+----------------------+------------+
Is there a way where I can group the date-time values for the 24hr time interval period such that I can get the expected output as shown? I want to have an efficient solution so that the code can run on 1 million time stamps efficiently.
As far as each step of this process depends on the result of the previous one, we can't use resample
and transform
. We have to iterate all the data sequentially mapping them in a prescribed manner:
step = pd.Timedelta('24H')
sentinel = df.loc[0, 'B'] + step
group_id = 1
for index, value in df['B'].items():
if value > sentinel:
sentinel = value + step
group_id += 1
df.loc[index, 'Group'] = group_id
We can also hide this process in a Series.map
with help of a generator, which may work somewhat faster:
def gen(start, step):
sentinel = start + step
group_id = 1
value = yield
while True:
if value > sentinel:
sentinel = value + step
group_id += 1
value = yield group_id
start, step = df.loc[0, 'B'], pd.Timedelta('24H')
marker = gen(start, step).send
marker(None) # pump/init/push the generator to the first yield
df['Group'] = df['B'].map(marker)
del marker
Note, that all this works if df['B']
is sorted beforehand (which seems logical to assume due to the described grouping algorithm).
Update (about generators)
A new generator gen(start, step)
has a methond .send(...)
used to pass inside of it some value. So we can switch from the generator itself to the function my_func = get(start, step).send
which will get some value and return next item from the original generator depending on this value (saving its internal state between calls).
The first call my_func(None)
is initialisation of the generator (pushing the code up to the first yield
). Next time, when we call my_func(val)
the variable val
will be passed inside the generator (it will be assigned to value
in the code above at the 4th line value = yield
), and the code of the generator will continue processing from the next line after yield
until it reaches next yielding, and so on.
As for the line del marker
, it is the manual deletion of a variable that refers to a generator object. Nothing special, just cleaning (not sure if it's vital though, we can delete this line).