I have 2 dataframes like these:
df1
Alias seq_RNA
Hsa-Mir-133-P2-v1-5p AGCTGGTAAAATGGAACCAAATC
Hsa-Mir-143-P1-v2-3p TTGGTCCCCTTCAACCAGCTGT
Hsa-Mir-183-P1-v1-3p TTTGGTCCCCTTCAACCAGCTGT
Hsa-Mir-490-3p CAACCTGGAGGACTCCATGCTGT
Hsa-Mir-499-5p TTAAGACTTGCAGTGATGTTTA
Hsa-Mir-163-P2-v2-3p TTGGTCCCCTTCAACCAGCTA
Hsa-Mir-122-as-5p TTTAGTGTGATAATGGCGTTTG
Hsa-Mir-1-P1-5p ACATACTTCTTTATATGCCCATA
df2
Alias MiRBase_ID
Hsa-Mir-4-P1 hsa-let-133
Hsa-Let-7-P1c hsa-let-7c
Hsa-Let-7-P1d hsa-let-7a-2
Hsa-Mir-183-P1 hsa-let-183
Hsa-Mir-122-as hsa-let-122
Hsa-Let-7-P2a3 hsa-let-7f-2
I want to generate a new df3 where if there is a substring match of one of the alias between df1 and df2 it create a new column adding the corresponding MiRBase_ID, like this:
Alias seq_RNA MiRBase_ID
Hsa-Mir-133-P2-v1-5p AGCTGGTAAAATGGAACCAAATC na
Hsa-Mir-143-P1-v2-3p TTGGTCCCCTTCAACCAGCTGT na
Hsa-Mir-183-P1-v1-3p TTTGGTCCCCTTCAACCAGCTGT hsa-let-183
Hsa-Mir-490-3p CAACCTGGAGGACTCCATGCTGT na
Hsa-Mir-499-5p TTAAGACTTGCAGTGATGTTTA na
Hsa-Mir-163-P2-v2-3p TTGGTCCCCTTCAACCAGCTA na
Hsa-Mir-122-as-5p TTTAGTGTGATAATGGCGTTTG hsa-let-122
Hsa-Mir-1-P1-5p ACATACTTCTTTATATGCCCATA na
So basically when it finds a match or even a partial match (but the best match possible) of one alias in df1 in df2 it takes the MiRBase_ID and put it in the new df3. otherwhise put na if no match is found. there is no order in the 2 dataframes so each of the best match needs to be searched in the whole Alias column and df1 and df2 are of different nr of rows.
Use str.extract
to get the substring and merge
:
import re
pattern = '|'.join(df2['Alias'].map(re.escape))
out = df1.merge(df2.drop(columns='Alias'),
left_on=df1['Alias'].str.extract(f'({pattern})', expand=False),
right_on=df2['Alias'], how='left').drop(columns='key_0')
Output:
Alias seq_RNA MiRBase_ID
0 Hsa-Mir-133-P2-v1-5p AGCTGGTAAAATGGAACCAAATC NaN
1 Hsa-Mir-143-P1-v2-3p TTGGTCCCCTTCAACCAGCTGT NaN
2 Hsa-Mir-183-P1-v1-3p TTTGGTCCCCTTCAACCAGCTGT hsa-let-183
3 Hsa-Mir-490-3p CAACCTGGAGGACTCCATGCTGT NaN
4 Hsa-Mir-499-5p TTAAGACTTGCAGTGATGTTTA NaN
5 Hsa-Mir-163-P2-v2-3p TTGGTCCCCTTCAACCAGCTA NaN
6 Hsa-Mir-122-as-5p TTTAGTGTGATAATGGCGTTTG hsa-let-122
7 Hsa-Mir-1-P1-5p ACATACTTCTTTATATGCCCATA NaN