Search code examples
pandasperformancefor-loopjoin

What is an effective way of joining two tables given in different formats, using pandas?


I am working with two tables:

df1: Zone column has say, 3 categorical levels whereas the weight column is a continuous column.

id Zone Weight
1 A 4
2 B 14
3 C 25
4 A 8.5
5 B 10

df2: The first 2 columns are lower bounds and upper bounds of the same Weight column in df1. Moreover, the levels of the Zone column of df1 are in df2 as separate columns named Zone A, Zone B & Zone C.

Weight_LB Weight_UB Zone A Zone B Zone C
1 10 -2.3 3.4 -3.6
11 20 -5.3 2.8 -7.4
21 30 -6.7 -1.9 2.5

Problem Statement:

I want a final dataframe df3 which would look like:

id Zone Weight New_Col
1 A 4 -2.3
2 B 14 2.8
3 C 25 2.5
4 A 8.5 -2.3
5 B 10 3.4

Note: It is somewhat similar to a left join but the code I am using currently is not at all efficient if df1 and df2 are very large.

I am using the following code right now:

# defining the two dataframes
df1 = pd.DataFrame({'id':[1,2,3,4,5],
                    'Zone':['A','B','C','A','B'],
                    'Weight':[4,14,25,8.5,10]})
df2 = pd.DataFrame({'Weight_LB':[1,11,21],
                    'Weight_UB':[10,20,30],
                    'A':[-2.3,-5.3,-6.7],
                    'B':[3.4,2.8,-1.9],
                    'C':[-3.6,-7.4,2.5]})

# unpivoting the df2 so that matching on conditions becomes easier
df2_melted = pd.melt(df2, id_vars=['Weight_LB','Weight_UB'], var_name = 'Zone', value_name = 'new_col')

# matching conditions on zone and weight between df1 and df2
new_col = {} #dictionary to store id-new column value pair
for index,row in df1.iterrows():
        for i,r in df2_melted.iterrows(): # nested for loops to compare conditions between df1 and df2
            condition = row['Zone']==r['Zone'] and r['Weight_LB']<=row['Weight'] and row['Weight']<=r['Weight_UB'] # conditions on zone and weight that need to be matched
            if condition:
                new_col[row['id']] = r['new_col'] # storing the values in the dictionary
                
new_col_df = pd.DataFrame(new_col.items(), columns=['id', 'new_col']) # converting the dictionary to a dataframe

# final resultant df3, with the left join being performed on id column to have the new column values mapped
df3 = pd.merge(df1,new_col_df,how='left',on='id')
df3

However, the above code is not very efficient due to obvious reasons when it comes to large datasets.

Is there a more pythonic way to do this? Any help is appreciated! Thanks in advance...


Solution

  • As your range seems to be consecutive, you can use merge_asof based on Weight_UB (or Weight_LB, it doesn't matter, you just have to change the direction):

    out = pd.merge_asof(
        df1.sort_values('Weight'), 
        df2_melted.sort_values('Weight_UB').astype({'Weight_UB': float}),
        left_on='Weight', right_on='Weight_UB', by='Zone', direction='forward'
    )[df1.columns.tolist() + ['new_col']]
    

    Output:

    >>> out.sort_values('id', ignore_index=True)
       id Zone  Weight  new_col
    0   1    A     4.0     -2.3
    1   2    B    14.0      2.8
    2   3    C    25.0      2.5
    3   4    A     8.5     -2.3
    4   5    B    10.0      3.4