Search code examples
pandasdataframenumpydatetimeindex

How to build "product" datetime index from a list of dates and a list of times in pandas?


I have

  • a list of dates (size=7713): [date(1994,5,25), ..., date(2023,12,19)]
  • a list of times (basically 5 minute interval of a day, size=288): [time(0, 0), ..., time(23, 55)]

I want to build a DatetimeIndex, where its full "product" combine, i.e., combine all time with each date. In the end, the size of the index is 7713 * 288.

My current solution is

datetimes = [pd.Timestamp.combine(d, t) for d in dates for t in times]
index = pd.DatetimeIndex(datetimes)

However it is super slow

  • combining takes around 9 seconds
  • building index takes around 9 seconds again

I tried directly give datetimes to pd.DataFrame(index=datetimes, ....), it takes 6 seconds (better than 9 seconds if I build index beforehand).

Do we have much faster way for this?


Solution

  • Your code is slow because it runs pd.Timestamp.combine in a double for loop.

    Instead, use pd.merge with type cross to build the "product" first. Then do either datetime.combine or a string summation as vector operations to create a joint set. Is either one of these two faster?

    from datetime import date, time, datetime
    import pandas as pd
    
    a_df = pd.DataFrame([date(1994,5,25), date(1999,6,14), date(2005,12,8)])
    b_df = pd.DataFrame([time(0, 0), time(16, 24), time(23, 55)])
    
    # step 1: build product with "merge"
    tmp = pd.merge(a_df, b_df, how='cross')
    
    # step 2a: collapse both cols with "datetime.combine"
    my_index_1 = tmp.apply(lambda r : datetime.combine(r['0_x'],r['0_y']),1)
    
    # step 2b: alternatively, collapse both cols via string conversion
    my_index_2 = pd.to_datetime(tmp['0_x'].astype(str) + ' ' + tmp['0_y'].astype(str))
    

    This gives you the desired result:

    print(my_index)
    0   1994-05-25 00:00:00
    1   1994-05-25 16:24:00
    2   1994-05-25 23:55:00
    3   1999-06-14 00:00:00
    4   1999-06-14 16:24:00
    5   1999-06-14 23:55:00
    6   2005-12-08 00:00:00
    7   2005-12-08 16:24:00
    8   2005-12-08 23:55:00
    dtype: datetime64[ns]
    

    Or, keep the date and time separate as a multiindex (depending on what analysis is coming down the road, maybe pivoting directly on time of day is good enough). Should spare you the effort/time needed to combine:

    tmp.set_index(['0_x', '0_y']).index
    MultiIndex([(1994-05-25, 00:00:00),
                (1994-05-25, 16:24:00),
                (1994-05-25, 23:55:00),
                (1999-06-14, 00:00:00),
                (1999-06-14, 16:24:00),
                (1999-06-14, 23:55:00),
                (2005-12-08, 00:00:00),
                (2005-12-08, 16:24:00),
                (2005-12-08, 23:55:00)],
               names=['0_x', '0_y'])