I'm trying to measure the time elapsed since the beginning of an event. In this case, I want to know if the volume of bitcoin traded per minute has exceeded a certain threshold. Because what moves the price is the volume. So I want to measure how long there has been significant volume, and record this measurement in a new column.
Here is an example of a dataframe that contains the date in index, the bitcoin price and the volume. I added a column that indicates when the volume has exceeded a certain threshold:
df = pd.DataFrame({
'Time': ['2022-01-11 09:30:00', '2022-01-11 09:31:00', '2022-01-11 09:32:00', '2022-01-11 09:33:00', '2022-01-11 09:34:00', '2022-01-11 09:35:00', ],
'Volume': ['132', '109', '74', '57', '123', '21'],
'Volume_cat': ["big_volume", "big_volume", None, None, "big_volume", None],
})
df['Time'] = pd.to_datetime(df['Time'])
df.set_index(['Time'], inplace =True)
df
My goal is to have a new column that will display the elapsed time (in seconds) since the last detection of the 'big_volume' event and will reset itself at each new detection. Here is a line that can be added to the example code:
df['delta_big_vol'] = ['60', '120', '180', '240', '60', '120',]
df
I have to use the apply() method, but have not found any lambda that would work. In pseudo code it would look like :
from datetime import timedelta
df['delta_xl_vol'] = df.apply(if df["Volume"] > 100 : return(timedelta.total_seconds))
Thanks for your help.
For this process, we can't have null values in our "Volume_cat" column:
>>> df["Volume_cat"] = df["Volume_cat"].fillna("-") # This could be any string except "big_volume"
This step will help us in the future. We'll remember if our data starts with a "big_volume"
and also store the index of the first "big_volume" row.
>>> idx_of_first_big_volume = df.loc[df["Volume_cat"] == "big_volume"].head(1).index[0]
>>> starts_with_big_volume = idx_of_first_big_volume == df.index[0]
Now, let's assign a group to each set of consecutive values in the "Volume_cat" column (consecutive "big_volume" are grouped, and consecutive "-" too).
>>> df["Group"] = ((df.Volume_cat != df.Volume_cat.shift()).cumsum())
Then, we'll rank each group. Now it's important to group consecutive groups, starting with a "big_volume" group followed by a "-" group, to assign the rank starting from the earliest "big_volume" event up until the last non-new-"big_volume" event (I hope this make sense). Also, notice how the starts_with_big_volume
help us align the groups properly. If we start with a "big_volume" group, we need to shift the values by subtracting 1:
>>> df["rank"] = df.groupby((df["Group"] - 1 * starts_with_big_volume)// 2)["Volume_cat"].rank("first", ascending=False)
Finally, we can use our "rank" column and multiply it by 60 to get the number of seconds since the last row with a "big_volume" observation. You can do this in a copy of your dataframe and then include the "delta_big_vol" column in your original dataframe, due to all this new columns in the output.
>>> df["delta_big_vol"] = 60 * (df["rank"] - 1)
Also, we now can use our idx_of_first_big_volume
to match your requirement of filling with None all of the observations before the first "big_volume" event:
>>> df.loc[:idx_of_first_big_volume, "delta_big_vol"].iloc[:-1] = None
This should be the output you get:
>>> df
Volume Volume_cat Group rank delta_big_vol
Time
2022-01-11 09:30:00 132 big_volume 1 1.0 0.0
2022-01-11 09:31:00 109 big_volume 1 2.0 60.0
2022-01-11 09:32:00 74 - 2 3.0 120.0
2022-01-11 09:33:00 57 - 2 4.0 180.0
2022-01-11 09:34:00 123 big_volume 3 1.0 0.0
2022-01-11 09:35:00 21 - 4 2.0 60.0