Search code examples
pythonpandasdate-range

Pandas CustomBusinessHour reduced performance


I need to reduce the performance of the following operation :

st = time.time()
bh = CustomBusinessHour(start='00:00', end='23:00')
bdates = pd.date_range(start='2024-01-01 00:00:00', end='2024-12-31 23:00:00', freq=bh, name='ts', closed=None)
print_statistics(f'{bdates}', 'filter_bp()', 'utils.py', time.time() - st)

which gives has the following output :

DatetimeIndex(['2024-01-01 00:00:00', '2024-01-01 01:00:00',
               '2024-01-01 02:00:00', '2024-01-01 03:00:00',
               '2024-01-01 04:00:00', '2024-01-01 05:00:00',
               '2024-01-01 06:00:00', '2024-01-01 07:00:00',
               '2024-01-01 08:00:00', '2024-01-01 09:00:00',
               ...
               '2024-12-31 13:00:00', '2024-12-31 14:00:00',
               '2024-12-31 15:00:00', '2024-12-31 16:00:00',
               '2024-12-31 17:00:00', '2024-12-31 18:00:00',
               '2024-12-31 19:00:00', '2024-12-31 20:00:00',
               '2024-12-31 21:00:00', '2024-12-31 22:00:00'],
              dtype='datetime64[ns]', name='ts', length=6026, freq='CBH')                                

Execution time: 0.40148735046386718750 s

If I try just a similar operation without CustomBusinessHour I have a much better execution time :

st = time.time()
test1 = pd.date_range(start='2024-01-01 00:00:00', end='2024-12-31 23:00:00', freq='H', name='ts', closed=None)
print_statistics(f'{test1}', 'filter_bp()', 'utils.py', time.time() - st)

which gives has the following output :

DatetimeIndex(['2024-01-01 00:00:00', '2024-01-01 01:00:00',
               '2024-01-01 02:00:00', '2024-01-01 03:00:00',
               '2024-01-01 04:00:00', '2024-01-01 05:00:00',
               '2024-01-01 06:00:00', '2024-01-01 07:00:00',
               '2024-01-01 08:00:00', '2024-01-01 09:00:00',
               ...
               '2024-12-31 14:00:00', '2024-12-31 15:00:00',
               '2024-12-31 16:00:00', '2024-12-31 17:00:00',
               '2024-12-31 18:00:00', '2024-12-31 19:00:00',
               '2024-12-31 20:00:00', '2024-12-31 21:00:00',
               '2024-12-31 22:00:00', '2024-12-31 23:00:00'],
              dtype='datetime64[ns]', name='ts', length=8784, freq='H')  

Execution time: 0.00157237052917480469 s

So the execution time is 400 times faster, just have to remove those records.

Anyone has any idea how to do it ? Many thanks !


Solution

  • The solution is the following portion of code :

    st = time.time()
    test1 = pd.date_range(start='2024-01-01 00:00:00', end='2024-12-31 23:00:00', freq='H', name='ts', closed=None)
    test2 = pd.DatetimeIndex(list(filter(lambda x: (np.is_busday(np.datetime64(x, 'D'))) and (x.hour != 23), test1)))
    print_statistics(f'{test2}', 'filter_bp()', 'utils.py', time.time() - st)
    

    which gives the following output :

    DatetimeIndex(['2024-01-01 00:00:00', '2024-01-01 01:00:00',
                   '2024-01-01 02:00:00', '2024-01-01 03:00:00',
                   '2024-01-01 04:00:00', '2024-01-01 05:00:00',
                   '2024-01-01 06:00:00', '2024-01-01 07:00:00',
                   '2024-01-01 08:00:00', '2024-01-01 09:00:00',
                   ...
                   '2024-12-31 13:00:00', '2024-12-31 14:00:00',
                   '2024-12-31 15:00:00', '2024-12-31 16:00:00',
                   '2024-12-31 17:00:00', '2024-12-31 18:00:00',
                   '2024-12-31 19:00:00', '2024-12-31 20:00:00',
                   '2024-12-31 21:00:00', '2024-12-31 22:00:00'],
                  dtype='datetime64[ns]', length=6026, freq=None)                                Execution time: 0.13873147964477539062 s