How to measure the time elapsed since the beginning of an event, and record it in a new dataframe column?

I'm trying to measure the time elapsed since the beginning of an event. In this case, I want to know if the volume of bitcoin traded per minute has exceeded a certain threshold. Because what moves the price is the volume. So I want to measure how long there has been significant volume, and record this measurement in a new column.

Here is an example of a dataframe that contains the date in index, the bitcoin price and the volume. I added a column that indicates when the volume has exceeded a certain threshold:

df = pd.DataFrame({
    'Time': ['2022-01-11 09:30:00', '2022-01-11 09:31:00', '2022-01-11 09:32:00', '2022-01-11 09:33:00', '2022-01-11 09:34:00', '2022-01-11 09:35:00', ],
    'Volume': ['132', '109', '74', '57', '123', '21'],
    'Volume_cat': ["big_volume", "big_volume", None, None, "big_volume", None],
})

df['Time'] = pd.to_datetime(df['Time'])
df.set_index(['Time'], inplace =True)
df

My goal is to have a new column that will display the elapsed time (in seconds) since the last detection of the 'big_volume' event and will reset itself at each new detection. Here is a line that can be added to the example code:

df['delta_big_vol'] = ['60', '120', '180', '240', '60', '120',]
df

I have to use the apply() method, but have not found any lambda that would work. In pseudo code it would look like :

from datetime import timedelta
df['delta_xl_vol'] = df.apply(if df["Volume"] > 100 : return(timedelta.total_seconds))

Thanks for your help.

Solution

For this process, we can't have null values in our "Volume_cat" column:

>>> df["Volume_cat"] = df["Volume_cat"].fillna("-")  # This could be any string except "big_volume"

This step will help us in the future. We'll remember if our data starts with a "big_volume" and also store the index of the first "big_volume" row.

>>> idx_of_first_big_volume = df.loc[df["Volume_cat"] == "big_volume"].head(1).index[0]
>>> starts_with_big_volume = idx_of_first_big_volume == df.index[0]

Now, let's assign a group to each set of consecutive values in the "Volume_cat" column (consecutive "big_volume" are grouped, and consecutive "-" too).

>>> df["Group"] = ((df.Volume_cat != df.Volume_cat.shift()).cumsum())

Then, we'll rank each group. Now it's important to group consecutive groups, starting with a "big_volume" group followed by a "-" group, to assign the rank starting from the earliest "big_volume" event up until the last non-new-"big_volume" event (I hope this make sense). Also, notice how the starts_with_big_volume help us align the groups properly. If we start with a "big_volume" group, we need to shift the values by subtracting 1:

>>> df["rank"] = df.groupby((df["Group"] - 1 * starts_with_big_volume)// 2)["Volume_cat"].rank("first", ascending=False)

Finally, we can use our "rank" column and multiply it by 60 to get the number of seconds since the last row with a "big_volume" observation. You can do this in a copy of your dataframe and then include the "delta_big_vol" column in your original dataframe, due to all this new columns in the output.

>>> df["delta_big_vol"] = 60 * (df["rank"] - 1)

Also, we now can use our idx_of_first_big_volume to match your requirement of filling with None all of the observations before the first "big_volume" event:

>>> df.loc[:idx_of_first_big_volume, "delta_big_vol"].iloc[:-1] = None

This should be the output you get:

>>> df
                    Volume  Volume_cat  Group  rank  delta_big_vol
Time                                                              
2022-01-11 09:30:00    132  big_volume      1   1.0            0.0
2022-01-11 09:31:00    109  big_volume      1   2.0           60.0
2022-01-11 09:32:00     74           -      2   3.0          120.0
2022-01-11 09:33:00     57           -      2   4.0          180.0
2022-01-11 09:34:00    123  big_volume      3   1.0            0.0
2022-01-11 09:35:00     21           -      4   2.0           60.0