I have a python pandas dataframe with DateTime values of multiple events Starting & Finishing.I want to build a tally of all the times a certain datetime (down to the nearest minute) is between the start and finish datetimes of the events. I use a dictionary to build the tally, then create into a dataframe, removing the date components and summing up all the tallys by time field.
I have the following code, which works, but i'm finding with a larger dataframe (1000s rows rather than 100s) the processing time is too slow to the point of useless. I'm usually a fan of vectorisation but can't seem to get it working with this problem.
Working code (but slow):
# Creating the dataframe
import pandas as pd
from datetime import date
data = {"Date":[date(2024,1,5),date(2024,1,6),date(2024,1,7)],"Start":[pd.Timestamp("2024-01-05 10:05"),pd.Timestamp("2024-01-06 09:05"),pd.Timestamp("2024-01-07 11:12")],"Finish":[pd.Timestamp("2024-01-05 10:35"),pd.Timestamp("2024-01-06 09:55"),pd.Timestamp("2024-01-07 11:58")]}
df = pd.DataFrame(data)
#Creating the Date range
dates = pd.date_range(start=df["Date"].min(), end=df["Date"].max(),freq="1min")
#Dictionary of dates to store the cummulative score
d_data = {}
for x in dates:
d_data[x] = 0
This next code, iterating by line I want to improve on, to gain a speed performance.
#Iterating by each line of the df and through the stored dates
for index,row in df.iterrows():
for d in dates:
if (d >= row[1]) & (d <= row[2]):
d_data[d] += 1
Then using the dictionary to create a dataframe
df_data = pd.DataFrame(index=d_data.keys(),data=d_data.values(),columns=["Count"])
df_data.reset_index(names="Date",inplace=True)
# Removing the date component to just leave the time and sum each occurance
df_data["Date"] = df_data["Date"].dt.time
df_data["Date"] = df_data["Date"].astype("str")
df_data["Date"] = df_data["Date"].str[:-3]
df_data = df_data.groupby("Date").sum().reset_index()
Output:
| Date | Count
0 | 00:00 | 0
1 | 00:01 | 0
545| 09:05 | 1
546| 09:06 | 1
# count the number of occurrences for each minute
date_ranges_list = []
for i, row in df.iterrows():
date_range = pd.date_range(start=row["Start"], end=row["Finish"], freq="1min")
date_ranges_list.append(pd.Series(date_range))
all_date_ranges = pd.concat(date_ranges_list)
minute_counts = all_date_ranges.value_counts().sort_index()
# if you want, you can add rows with zero
all_dates = pd.Series(
0, index=pd.date_range(start=df["Date"].min(), end=df["Date"].max(), freq="1min")
)
minute_counts = all_dates.add(minute_counts, fill_value=0)
# convert to dataframe
out = minute_counts.to_frame(name="Count")
Result:
Count
2024-01-05 00:00:00 0.0
2024-01-05 00:01:00 0.0
2024-01-05 00:02:00 0.0
2024-01-05 00:03:00 0.0
2024-01-05 00:04:00 0.0
... ...
2024-01-07 11:54:00 1.0
2024-01-07 11:55:00 1.0
2024-01-07 11:56:00 1.0
2024-01-07 11:57:00 1.0
2024-01-07 11:58:00 1.0