I have a small pandas dataframe containing data with just a few rows and three columns:
import pandas as pd
df_size = pd.DataFrame([[0.510,0.450,0.540],
[0.899,0.820,1.150],
[1.745,1.587,2.020],
[2.020,1.745,2.405],
], columns=['diameter_mean', 'diameter_min','diameter_max'])
a second dataframe contains a (much longer) lookup table:
df_lookup = pd.DataFrame([[0.450,0.021548],
[0.510,0.021791],
[0.540,0.022038],
[0.565,0.022289],
[0.695,0.022545],
[0.720,0.034321],
[0.770,1.292340],
[0.820,1.296070],
[0.899,1.302340],
[1.150,2.311770],
[1.361,3.325140],
[1.587,4.144621],
[1.745,3.498933],
[2.020,3.512665],
[2.405,3.610773],
], columns=['diameter', 'SMS'])
Means, not for any lookup table entry there might be a data point in df_size existing.
Based on df_lookup['diameter']
I want to automatically find the according SMS-values for all three columns of df_size['diameter_mean']
, df_size['diameter_min']
and df_size['diameter_max']
and append the found values as three new columns ['SMS'], ['SMS_min'], ['SMS_max'] to the dataset dataframe 'df_size'.
I tried to create the three new columns with merge, but this causes -as expected- a valueError:
df_size['SMS'] = df_size.merge(df_lookup, left_on='diameter_mean', right_on='diameter')
df_size['SMS_min'] = df_size.merge(df_lookup, left_on='diameter_min', right_on='diameter')
df_size['SMS_max'] = df_size.merge(df_lookup, left_on='diameter_max', right_on='diameter')
due to multiple columns that would be set to one for all three lines of code.
Alternatively I tried solutions with apply
and with map
but it seems that I miss something (here example for the diameter_mean-column only):
df_size['SMS'].apply(lambda df_lookup.SMS: df_lookup['diameter'][(df_size['diameter_mean'])].values[0])
causes a key error.
Aimed df_size would look like:
df_size
'diameter_mean' 'diameter_min' 'diameter_max' 'SMS' 'SMS_min' 'SMS_max'
0.510 0.450 0.540 0.021791 0.021548 0.022038
0.899 0.820 1.150 1.302340 1.296070 2.311770
1.745 1.587 2.020 3.498933 4.144621 3.512665
2.020 1.745 2.405 3.512665 3.498933 3.610773
Is it necessary by the way that both dataframes have strongly monotonic behavior in therms of the lookup parameter (=diameter)?
You could use a series of merge_asof
with varying references and directions:
tmp = df_size.reset_index()
merges = {'SMS': ('diameter_mean', 'nearest'),
'SMS_min': ('diameter_min', 'forward'),
'SMS_max': ('diameter_max', 'backward'),
}
for k, (c, d) in merges.items():
df_size[k] = pd.merge_asof(
tmp.sort_values(by=c)[['index', c]],
df_lookup,
left_on=c, right_on='diameter',
direction=d
).set_index('index')['SMS']
Output:
diameter_mean diameter_min diameter_max SMS SMS_min SMS_max
0 0.510 0.450 0.540 0.021791 0.021548 0.022038
1 0.899 0.820 1.150 1.302340 1.296070 2.311770
2 1.745 1.587 2.020 3.498933 4.144621 3.512665
3 2.020 1.745 2.405 3.512665 3.498933 3.610773