Search code examples
pythonpandasvectorization

A faster way to build a cumulative tally of DateTime values using Pandas?


I have a python pandas dataframe with DateTime values of multiple events Starting & Finishing.I want to build a tally of all the times a certain datetime (down to the nearest minute) is between the start and finish datetimes of the events. I use a dictionary to build the tally, then create into a dataframe, removing the date components and summing up all the tallys by time field.

I have the following code, which works, but i'm finding with a larger dataframe (1000s rows rather than 100s) the processing time is too slow to the point of useless. I'm usually a fan of vectorisation but can't seem to get it working with this problem.

Working code (but slow):

# Creating the dataframe
import pandas as pd
from datetime import date

data = {"Date":[date(2024,1,5),date(2024,1,6),date(2024,1,7)],"Start":[pd.Timestamp("2024-01-05 10:05"),pd.Timestamp("2024-01-06 09:05"),pd.Timestamp("2024-01-07 11:12")],"Finish":[pd.Timestamp("2024-01-05 10:35"),pd.Timestamp("2024-01-06 09:55"),pd.Timestamp("2024-01-07 11:58")]}

df = pd.DataFrame(data)

#Creating the Date range
dates = pd.date_range(start=df["Date"].min(), end=df["Date"].max(),freq="1min")

#Dictionary of dates to store the cummulative score
d_data = {}

for x in dates:
    d_data[x] = 0

This next code, iterating by line I want to improve on, to gain a speed performance.

#Iterating by each line of the df and through the stored dates
for index,row in df.iterrows():
    for d in dates:
        if (d >= row[1]) & (d <= row[2]):
            d_data[d] += 1

Then using the dictionary to create a dataframe

df_data = pd.DataFrame(index=d_data.keys(),data=d_data.values(),columns=["Count"])
df_data.reset_index(names="Date",inplace=True)

# Removing the date component to just leave the time and sum each occurance
df_data["Date"] = df_data["Date"].dt.time
df_data["Date"] = df_data["Date"].astype("str")
df_data["Date"] = df_data["Date"].str[:-3]

df_data = df_data.groupby("Date").sum().reset_index()

Output:

   | Date  | Count 
0  | 00:00 | 0
1  | 00:01 | 0

545| 09:05 | 1
546| 09:06 | 1

Solution

  • # count the number of occurrences for each minute
    date_ranges_list = []
    for i, row in df.iterrows():
        date_range = pd.date_range(start=row["Start"], end=row["Finish"], freq="1min")
        date_ranges_list.append(pd.Series(date_range))
    all_date_ranges = pd.concat(date_ranges_list)
    minute_counts = all_date_ranges.value_counts().sort_index()
    
    # if you want, you can add rows with zero
    all_dates = pd.Series(
        0, index=pd.date_range(start=df["Date"].min(), end=df["Date"].max(), freq="1min")
    )
    minute_counts = all_dates.add(minute_counts, fill_value=0)
    
    # convert to dataframe
    out = minute_counts.to_frame(name="Count")
    

    Result:

                         Count
    2024-01-05 00:00:00    0.0
    2024-01-05 00:01:00    0.0
    2024-01-05 00:02:00    0.0
    2024-01-05 00:03:00    0.0
    2024-01-05 00:04:00    0.0
    ...                    ...
    2024-01-07 11:54:00    1.0
    2024-01-07 11:55:00    1.0
    2024-01-07 11:56:00    1.0
    2024-01-07 11:57:00    1.0
    2024-01-07 11:58:00    1.0