Search code examples
pythonpandasdataframelookup-tables

pandas dataframe as lookup table for another dataframe of a different length


I have a small pandas dataframe containing data with just a few rows and three columns:

import pandas as pd
df_size = pd.DataFrame([[0.510,0.450,0.540],   
                        [0.899,0.820,1.150],   
                        [1.745,1.587,2.020],   
                        [2.020,1.745,2.405],   
                       ], columns=['diameter_mean', 'diameter_min','diameter_max'])

a second dataframe contains a (much longer) lookup table:

df_lookup = pd.DataFrame([[0.450,0.021548],
                          [0.510,0.021791],
                          [0.540,0.022038],
                          [0.565,0.022289],
                          [0.695,0.022545],
                          [0.720,0.034321],
                          [0.770,1.292340],
                          [0.820,1.296070],
                          [0.899,1.302340],
                          [1.150,2.311770],
                          [1.361,3.325140],
                          [1.587,4.144621],
                          [1.745,3.498933],
                          [2.020,3.512665],
                          [2.405,3.610773],
                        ], columns=['diameter', 'SMS'])

Means, not for any lookup table entry there might be a data point in df_size existing.

Based on df_lookup['diameter'] I want to automatically find the according SMS-values for all three columns of df_size['diameter_mean'], df_size['diameter_min'] and df_size['diameter_max'] and append the found values as three new columns ['SMS'], ['SMS_min'], ['SMS_max'] to the dataset dataframe 'df_size'.

I tried to create the three new columns with merge, but this causes -as expected- a valueError:

df_size['SMS'] = df_size.merge(df_lookup, left_on='diameter_mean', right_on='diameter')
df_size['SMS_min'] = df_size.merge(df_lookup, left_on='diameter_min', right_on='diameter')
df_size['SMS_max'] = df_size.merge(df_lookup, left_on='diameter_max', right_on='diameter')

due to multiple columns that would be set to one for all three lines of code.

Alternatively I tried solutions with apply and with map but it seems that I miss something (here example for the diameter_mean-column only):

df_size['SMS'].apply(lambda df_lookup.SMS: df_lookup['diameter'][(df_size['diameter_mean'])].values[0])

causes a key error.

Aimed df_size would look like:

df_size
'diameter_mean' 'diameter_min' 'diameter_max' 'SMS'     'SMS_min'    'SMS_max'
0.510           0.450           0.540         0.021791  0.021548     0.022038
0.899           0.820           1.150         1.302340  1.296070     2.311770
1.745           1.587           2.020         3.498933  4.144621     3.512665
2.020           1.745           2.405         3.512665  3.498933     3.610773

Is it necessary by the way that both dataframes have strongly monotonic behavior in therms of the lookup parameter (=diameter)?


Solution

  • You could use a series of merge_asof with varying references and directions:

    tmp = df_size.reset_index()
    
    merges = {'SMS': ('diameter_mean', 'nearest'),
              'SMS_min': ('diameter_min', 'forward'),
              'SMS_max': ('diameter_max', 'backward'),
             }
    
    for k, (c, d) in merges.items():
        df_size[k] = pd.merge_asof(
                          tmp.sort_values(by=c)[['index', c]],
                          df_lookup,
                          left_on=c, right_on='diameter',
                          direction=d
                     ).set_index('index')['SMS']
    

    Output:

       diameter_mean  diameter_min  diameter_max       SMS   SMS_min   SMS_max
    0          0.510         0.450         0.540  0.021791  0.021548  0.022038
    1          0.899         0.820         1.150  1.302340  1.296070  2.311770
    2          1.745         1.587         2.020  3.498933  4.144621  3.512665
    3          2.020         1.745         2.405  3.512665  3.498933  3.610773