Search code examples
pythonpandasdataframeoptimizationvectorization

How to optimise an inequality join in Pandas?


I have a cross-join-like operation that I have implemented using a for loop. I need to make it fast and preferably elegant. It creates a block entry per day with a date range condition.

This works fine for small datasets but completely stalls into a very slow runtime for larger datasets. I know that it can be vectorized. My implementation is very bad.

I have looked at the other posts on how to vectorize loops in DataFrames. I read 10 minutes to pandas as per suggested by this post How to iterate over rows in a DataFrame in Pandas, tried using lambda functions. Messed with Cython. I just can't get it.

I tried implementing [pandas.MultiIndex.to_frame] and I have a strong feeling this, or one of it's cousins, is a good way to go. I have also tried a bunch of other things and nothing.

I want to learn to code elegantly. All suggestions, variations on the solution and, comments are welcome.

from datetime import datetime
import pandas as pd 

beginning = pd.to_datetime('14/09/2021', dayfirst=True)
today = pd.to_datetime(datetime.today())
date_range = pd.date_range(start=beginning, end=today)  # .tolist()

frame = pd.DataFrame(columns=['Record_Date', 'Identifier', 'start_date', 'end_date', 'color'])
block = pd.DataFrame(
    {'Identifier': ['4913151F', 'F4E9124A', '31715888', 'D0C57FCA', '57B4D7EB', 'E46F1E5D', '99E0A2F8', 'D77E342E',
                    'C596D233', 'D0EED63F', 'D0C57FCA'],
     'start_date': ['03/11/2020', '05/07/2022', '22/12/2016', '17/03/2024', '14/10/2022', '08/08/2022', '04/11/2020',
                    '13/03/2023', '05/11/2021', '12/27/2022', '13/06/2022'],
     'end_date': ['11/07/2023', '11/04/2023', '14/12/2018', '20/01/2025', '15/06/2023', '09/01/2023', '16/07/2022',
                  '19/05/2024', '24/09/2022', '17/11/2023', '13/06/2023'],
     'color': ['red', 'green', 'magenta', 'yellow', 'light_blue', 'dark_blue', 'black', 'white', 'pink', 'orange',
               'yellow']})

block.start_date = pd.to_datetime(block.start_date, dayfirst=True, format='mixed')
block.end_date = pd.to_datetime(block.end_date, dayfirst=True, format='mixed')
block_uniques = block.drop_duplicates(['Identifier', 'start_date'])

for x in date_range:
    temp_df = block_uniques[(block_uniques.start_date <= x) & (block_uniques.end_date >= x)] 
    temp_df.insert(0, 'Record_Date', x)
    frame = pd.concat([frame, temp_df])
frame = frame.sort_values(['Record_Date', 'Identifier'])
frame = frame.reset_index().drop('index', axis=1)

print(frame)

Output and solution:

     Record_Date Identifier start_date   end_date   color
0     2021-09-14   4913151F 2020-11-03 2023-07-11     red
1     2021-09-14   99E0A2F8 2020-11-04 2022-07-16   black
2     2021-09-15   4913151F 2020-11-03 2023-07-11     red
3     2021-09-15   99E0A2F8 2020-11-04 2022-07-16   black
4     2021-09-16   4913151F 2020-11-03 2023-07-11     red
...          ...        ...        ...        ...     ...
2641  2023-07-05   D0EED63F 2022-12-27 2023-11-17  orange
2642  2023-07-05   D77E342E 2023-03-13 2024-05-19   white
2643  2023-07-06   4913151F 2020-11-03 2023-07-11     red
2644  2023-07-06   D0EED63F 2022-12-27 2023-11-17  orange
2645  2023-07-06   D77E342E 2023-03-13 2024-05-19   white

[2646 rows x 5 columns]


Solution

  • Looks like some form of inequality join; conditional_join offers an efficient way to handle this. Note that if your dates in block are not overlapping, then pd.IntervalIndex is suitable and performant.

    # pip install pyjanitor
    import janitor
    import pandas as pd
    
    # convert date_range to either a named series, or a dataframe
    date_range = pd.Series(date_range, name = 'date')
    
    (block
    .conditional_join(
        date_range, 
        # column from the left,
        # column from the right,
        # operator
        ('start_date', 'date', '<='), 
        ('end_date', 'date', '>='),
        # in some scenarios,
        # numba might offer a perf boost
        use_numba=False,
       )
    )
         Identifier start_date   end_date   color       date
    0      4913151F 2020-11-03 2023-07-11     red 2021-09-14
    1      4913151F 2020-11-03 2023-07-11     red 2021-09-15
    2      4913151F 2020-11-03 2023-07-11     red 2021-09-16
    3      4913151F 2020-11-03 2023-07-11     red 2021-09-17
    4      4913151F 2020-11-03 2023-07-11     red 2021-09-18
    ...         ...        ...        ...     ...        ...
    2644   D0C57FCA 2022-06-13 2023-06-13  yellow 2023-06-09
    2645   D0C57FCA 2022-06-13 2023-06-13  yellow 2023-06-10
    2646   D0C57FCA 2022-06-13 2023-06-13  yellow 2023-06-11
    2647   D0C57FCA 2022-06-13 2023-06-13  yellow 2023-06-12
    2648   D0C57FCA 2022-06-13 2023-06-13  yellow 2023-06-13
    
    [2649 rows x 5 columns]