I have 2 data frames
print(df1)
Name df1 RT [min] Molecular Weight RT [min]+0.2 RT [min]-0.2 Molecular Weight + 0.2 Molecular Weight - 0.2
0 unknow compound 1 7.590 194.04212 7.790 7.390 194.24212 193.84212
1 unknow compound 2 7.510 194.15000 7.710 7.310 194.35000 193.95000
2 unknow compound 3 7.410 194.04209 7.610 7.210 194.24209 193.84209
3 unknow compound 4 7.434 342.11615 7.634 7.234 342.31615 341.91615
4 unknow compound 5 0.756 176.03128 0.956 0.556 176.23128 175.83128
and
print(df2)
Name df2 Molecular Weight RT [min]
0 β-D-Glucopyranuronic acid 194.04220 7.483
1 α,α-Trehalose 194.10000 7.350
2 Threonylserine 206.08970 8.258
3 Terephthalic acid 166.02595 7.465
4 Sulfuric acid 97.96714 8.909
I want to merge the rows from df2 to those in df1 if 2 conditions are met.
This means that if a row from df2 meets the conditions of two other rows from df1, the row of df1 will be duplicated.
so df3 should look
print(df3)
Name df1 RT [min]+0.2 RT [min]-0.2 Molecular Weight + 0.2 Molecular Weight - 0.2 Name df2 Molecular Weight RT [min]
0 unknow compound 1 7.790 7.390 194.24212 193.84212 β-D-Glucopyranuronic acid 194.0422 7.483
1 unknow compound 1 7.790 7.390 194.24212 193.84212 α,α-Trehalose 194.1000 7.350
2 unknow compound 2 7.710 7.310 194.35000 193.95000 β-D-Glucopyranuronic acid 194.0422 7.483
3 unknow compound 3 8.310 7.910 206.30000 205.90000 Threonylserine 206.0897 8.258
4 unknow compound 4 7.634 7.234 342.31615 341.91615 NaN NaN NaN
5 unknow compound 5 0.956 0.556 176.23128 175.83128 NaN NaN NaN
the first row in df2 meets the 2 conditions of the unknow compound 1 and unknow compound 2 from df1 so we have it twice in df3.
the second row in df2 meets the 2 conditions of the unknow compound 1 only.
the third row in df2 meets the 2 conditions of the unknow compound 3 only.
all the other rows do not meet any of the conditions from df1.
I tried to do it based on How to join two data frames for which column values are within a certain range? the first answer
import pandas as pd
df_1 = pd.read_excel (r'D:\CD SandBox\df1.xlsx')
df_2 = pd.read_excel (r'D:\CD SandBox\df2.xlsx')
df2.index = pd.IntervalIndex.from_arrays(df2['RT [min]-0.2'],df2['RT [min]+0.2'],closed='both')
df2['RT [min]'] = df2['RT [min]'].apply( lambda x : df2.iloc[df1.index.get_loc(x)])
But do not know what to do with the second line of code and receive this error:
df2['RT [min]'] = df2['RT [min]'].apply( lambda x : df2.iloc[df1.index.get_loc(x)])
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Users\BCDD\Anaconda3\envs\PTSD\lib\site-packages\pandas\core\series.py", line 4213, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas\_libs\lib.pyx", line 2403, in pandas._libs.lib.map_infer
File "<input>", line 1, in <lambda>
File "C:\Users\BCDD\Anaconda3\envs\PTSD\lib\site-packages\pandas\core\indexes\interval.py", line 730, in get_loc
raise KeyError(key)
KeyError: 8.258
EDIT: Trying using merge_asof
according to How to join two DataFrames with multiple overlapping timestamps using an extra shared variable
df2 = df2.drop(['RT [min]', 'Molecular Weight'], axis=1)
df2['RT [min]']=df2['RT [min]-0.2']
pd.merge_asof(df2[['RT [min]','Name df2']] , df1,on='RT [min]',direction ='forward',allow_exact_matches =True)
...
RT [min] Name df2 Name df1 Molecular Weight
0 0.556 unknow compound 5 α,α-Trehalose 194.10000
1 7.210 unknow compound 3 α,α-Trehalose 194.10000
2 7.234 unknow compound 4 α,α-Trehalose 194.10000
3 7.310 unknow compound 2 α,α-Trehalose 194.10000
4 7.390 unknow compound 1 Terephthalic acid 166.02595
gives wrong matching for the table.
Any idea\ hint will be appreciated
If you're using pandas 1.2.0 you can create the cartesian product of both dataframes and then check the conditions. Also, as you don't need RT [min]
and Molecular Weight
from df1
, I'll assume you already removed them:
df3 = df1.merge(df2, how = 'cross', suffixes = [None,None])
#check if 'Molecular Weight' is in the interval:
mask1 = df3['Molecular Weight'].ge(df3['Molecular Weight - 0.2']) & df3['Molecular Weight'].le(df3['Molecular Weight + 0.2'])
#check if 'RT [min]' is in the interval
mask2 = df3['RT [min]'].ge(df3['RT [min]-0.2']) & df3['RT [min]'].le(df3['RT [min]+0.2'])
df3 = df3[mask1 & mask2].reset_index(drop = True)
Output:
df3
Name df1 RT [min]+0.2 RT [min]-0.2 ... Name df2 Molecular Weight RT [min]
0 unknow compound 1 7.79 7.39 ... β-D-Glucopyranuronic acid 194.0422 7.483
1 unknow compound 2 7.71 7.31 ... β-D-Glucopyranuronic acid 194.0422 7.483
2 unknow compound 2 7.71 7.31 ... α,α-Trehalose 194.1000 7.350
3 unknow compound 3 7.61 7.21 ... β-D-Glucopyranuronic acid 194.0422 7.483
4 unknow compound 3 7.61 7.21 ... α,α-Trehalose 194.1000 7.350
As your data is considerably large, may you like to use a generator in order to don't load the whole resulting dataframe. Again, I'm assuming you removed RT [min]
and Molecular Weight
from df1
.
import numpy as np
from itertools import product
def df_iter(df1,df2):
for row1, row2 in product(df1.values, df2.values):
# RT [min]-0.2 <= RT [min] <= RT [min]+0.2
if row1[2] <= row2[2] <= row1[1]:
#Molecular Weight - 0.2 <= Molecular Weight <= Molecular Weight + 0.2
if row1[4] <= row2[1] <= row1[3]:
yield np.concatenate((row1,row2))
df3_rows = df_iter(df1,df2)
Then you can manipulate the rows:
for row in df3_rows:
print(row)
Output:
['unknow compound 1' 7.79 7.39 194.24212 193.84212 'β-D-Glucopyranuronic acid' 194.0422 7.483]
['unknow compound 2' 7.71 7.31 194.35 193.95 'β-D-Glucopyranuronic acid' 194.0422 7.483]
['unknow compound 2' 7.71 7.31 194.35 193.95 'α,α-Trehalose' 194.1 7.35]
['unknow compound 3' 7.61 7.21 194.24209 193.84209 'β-D-Glucopyranuronic acid' 194.0422 7.483]
['unknow compound 3' 7.61 7.21 194.24209 193.84209 'α,α-Trehalose' 194.1 7.35]
Or create a dataframe:
df3 = pd.DataFrame(data = list(df3_rows),
columns = np.concatenate((df1.columns, df2.columns)))
Which results in the same dataframe from Option 1.
NOTE1: Be careful with the indices in the conditionals from function df_iter
, those work in my case.
NOTE2: I'm pretty sure your data doesn't match with the example df3
.