Search code examples
pythonpandas

Pandas Merging on Nearest Date with Consideration for Grouping


I am developing a python solution to a data transformation challenge and am experiencing a small hurdle that I can't seem to figure out. I want to merge the two data frames (i.e., df1 & df2) to create a new data frame that has inserted df2['value2'] into df1['value1'] while considering date and 'Type' alignment. Has anyone ever solved a similar problem? Is pandas the best library for solving this problem?

Please see the code snippet I am using in my jupyter notebook for solution testing below.

import pandas as pd 

df1 = pd.DataFrame({ 'Type': ['Labor', 'Material', 'Labor', 'Material' , 'Labor', 'Material', 'Labor', 'Material'],'date1': ['2021-01-01', '2021-01-01', '2021-02-01', '2021-02-01', '2021-03-01', '2021-03-01', '2021-04-01', '2021-04-01'], 'value1': [0,0,0,0,0,0,0,0]})
df2 = pd.DataFrame({ 'Type': ['Labor', 'Material', 'Labor', 'Material'],'date2': ['2021-01- 
11', '2021-02-22', '2021-02-05', '2021-03-15'], 'value2': [10,20,25,45]})

display(df1)
display(df2)

df1['date1'] = pd.to_datetime(df1['date1'])
df2['date2'] = pd.to_datetime(df2['date2'])

df1 = df1.sort_values(['date1', 'Type'])
df2 = df2.sort_values(['date2', 'Type'])

merge_df = pd.merge_asof(df2, df1, left_on= 'date2', right_on= 'date1' , by = 'Type', 
direction = 'nearest')

display(merge_df)

The output I am trying to achieve is visualized below.

Type Date Value
Labor 2021-01-01 10
Material 2021-01-01 0
Labor 2021-02-01 25
Material 2021-02-01 20
Labor 2021-03-01 0
Material 2021-03-01 45
Labor 2021-04-01 0
Material 2021-04-01 0

Solution

  • merge_asof is a left merge, you need to use df1 on the left. Also you have to use the month as by:

    out = (pd.merge_asof(df1.assign(month=df1['date1'].dt.to_period('M')),
                         df2.assign(month=df2['date2'].dt.to_period('M')),
                         left_on='date1', right_on='date2',
                         by=['Type', 'month'], direction='nearest')
             .fillna({'value2': 0})
          )
    

    Output:

           Type      date1  value1    month      date2  value2
    0     Labor 2021-01-01       0  2021-01 2021-01-11    10.0
    1  Material 2021-01-01       0  2021-01        NaT     0.0
    2     Labor 2021-02-01       0  2021-02 2021-02-05    25.0
    3  Material 2021-02-01       0  2021-02 2021-02-22    20.0
    4     Labor 2021-03-01       0  2021-03        NaT     0.0
    5  Material 2021-03-01       0  2021-03 2021-03-15    45.0
    6     Labor 2021-04-01       0  2021-04        NaT     0.0
    7  Material 2021-04-01       0  2021-04        NaT     0.0
    

    Intermediates:

    # df1.assign(month=df1['date1'].dt.to_period('M'))
           Type      date1  value1    month
    0     Labor 2021-01-01       0  2021-01
    1  Material 2021-01-01       0  2021-01
    2     Labor 2021-02-01       0  2021-02
    3  Material 2021-02-01       0  2021-02
    4     Labor 2021-03-01       0  2021-03
    5  Material 2021-03-01       0  2021-03
    6     Labor 2021-04-01       0  2021-04
    7  Material 2021-04-01       0  2021-04
    
    # df2.assign(month=df2['date2'].dt.to_period('M'))
           Type      date2  value2    month
    0     Labor 2021-01-11      10  2021-01
    2     Labor 2021-02-05      25  2021-02
    1  Material 2021-02-22      20  2021-02
    3  Material 2021-03-15      45  2021-03